OVERLAPPED GEOMETRY PROCESSING IN A MULTICORE GPU

20230111909 · 2023-04-13

    Inventors

    Cpc classification

    International classification

    Abstract

    A multicore graphics processing unit (GPU) and a method of operating a GPU having at least a first core and a second core. A client driver writes a series of geometry commands in the command buffer, along with associated dependency data that indicates the extent to which correct execution of the geometry commands is dependent on the completion of execution of other commands. The first core reads a first geometry command from the command buffer and executes it. The second core reads a second geometry command from the command buffer. The second core determines that the second geometry command is not dependent on the results of the first geometry command, and, in response, executes the second geometry command.

    Claims

    1. A method of operating a multicore graphics processing unit (GPU) configured to perform tile-based rendering, to enable overlapping processing of geometry commands, the multicore GPU comprising at least a first core and a second core, the method comprising: providing a series of geometry commands in a command buffer, the geometry commands having associated dependency data indicating which of the geometry commands are dependent on the completion of other commands; reading a first geometry command from the command buffer; starting to execute the first geometry command using the first core; reading a second geometry command from the command buffer; after reading the second geometry command from the command buffer, determining that the second geometry command is not dependent on the results of the first geometry command; and in response, starting to execute the second geometry command using the second core.

    2. The method of claim 1, comprising maintaining a write offset, denoting the position in the command buffer at which new commands should be written by a client driver.

    3. The method of claim 1, comprising maintaining: a first read offset, denoting the position at which the first core should read the next geometry command that it is to execute; a second read offset, denoting the position at which the second core should read the next geometry command that it is to execute; and a dependency offset, denoting the earliest dependency in the series of geometry commands that has not yet been satisfied.

    4. The method of claim 3, wherein determining that the second geometry command is not dependent on the results of the first geometry command comprises: advancing the second read offset until it reaches either the second geometry command or the dependency offset; comparing the second read offset with the dependency offset; if the second read offset is less than the dependency offset, determining that the second geometry command is not dependent on the results of the first geometry command; and otherwise, determining that the second geometry command is dependent on the results of the first geometry command.

    5. The method of claim 3, further comprising, when the first core finishes executing the first geometry command: advancing the first read offset beyond the second read offset, until it reaches either a third geometry command or the dependency offset; if the first read offset reaches the third geometry command, determining that the third geometry command is not dependent on the results of the second geometry command; and in response, executing the third geometry command using the first core.

    6. The method of claim 1, wherein the command buffer also includes at least one dependency indicator, indicating that a geometry command following the dependency indicator is dependent on another command.

    7. The method of claim 6, further comprising advancing the dependency offset (Doff) to the earliest dependency indicator in the command buffer whose dependency is not yet satisfied.

    8. The method of claim 1, wherein the command buffer also includes at least one update command; and wherein the at least one update command is executed by the core that has just executed the earliest geometry command in the series that was, immediately prior to the completion of its execution, the earliest unexecuted geometry command in the series.

    9. The method of claim 1, wherein the first core is configured to write the results of geometry commands that it executes to a first parameter buffer; and the second core is configured to write the results of geometry commands that it executes to a second parameter buffer, separate from the first parameter buffer.

    10. The method of claim 9, wherein each parameter buffer is subdivided and used by multiple renders.

    11. The method of claim 1, wherein both the first core and one or more other cores are configured to execute fragment processing commands based on the results of geometry commands executed by the first core.

    12. The method of claim 11, wherein each of the one or more other cores is configured to, when it finishes processing the results of a geometry command executed by the first core, signal this to the first core; and wherein the first core is configured to, in response to finishing the fragment processing commands and receiving the signal from each core of said at least one other core, free the memory that was used to store the respective results.

    13. The method of claim 1, wherein: a first affinity is set for the first geometry command such that, if the first core is interrupted while executing the first geometry command, the first geometry command will only resume processing on the first core; and a second affinity is set for the second geometry command such that, if the second core is interrupted while executing the second geometry command, the second geometry command will only resume processing on the second core.

    14. The method of claim 1, wherein the first and second geometry commands relate to different frames.

    15. The method of claim 1, wherein the first and second geometry commands relate to the same frame.

    16. The method of claim 1, wherein the command buffer is a circular command buffer.

    17. A multicore graphics processing unit (GPU) configured to enable overlapping processing of geometry commands during tile-based rendering, the multicore GPU comprising at least a first core and a second core, and a command buffer; wherein the command buffer is configured to hold a series of geometry commands written by a client driver, the geometry commands having associated dependency data indicating which of the geometry commands are dependent on the completion of other commands; wherein the first core is configured to: read a first geometry command from the command buffer, and start to execute the first geometry command; and wherein the second core is configured to: read a second geometry command from the command buffer, after reading the second geometry command from the command buffer, determine that the second geometry command is not dependent on the results of the first geometry command, and in response, start to execute the second geometry command.

    18. The multicore GPU of claim 17, wherein: the first core is configured to maintain a first read offset, denoting the position at which the first core should read the next geometry command that it is to execute; the second core is configured to maintain a second read offset, denoting the position at which the second core should read the next geometry command that it is to execute; and the GPU is configured to maintain a dependency offset, denoting the earliest dependency in the series of geometry commands that has not yet been satisfied.

    19. The multicore GPU of claim 17, wherein the command buffer also includes at least one update command, wherein each of the first core and the second core is configured to execute the at least one update command only if that core has just executed the earliest geometry command in the series that was, immediately prior to its execution, the earliest unexecuted geometry command in the series.

    20. A non-transitory computer readable storage medium having stored thereon computer readable code configured to cause the method of claim 1 to be performed when the code is run.

    Description

    BRIEF DESCRIPTION OF THE DRAWINGS

    [0091] Examples will now be described in detail with reference to the accompanying drawings in which:

    [0092] FIG. 1 is a block diagram of a GPU according to an example;

    [0093] FIG. 2 illustrates a portion of a circular command buffer for geometry processing, according to an example;

    [0094] FIG. 3 is a flowchart illustrating a method according to an example;

    [0095] FIG. 4 is a flowchart illustrating a further method according to an example;

    [0096] FIG. 5A shows an example of processing geometry sequentially according to a comparative example;

    [0097] FIG. 5B shows an example of processing geometry in parallel according to an example;

    [0098] FIG. 6 shows a computer system in which a graphics processing system is implemented; and

    [0099] FIG. 7 shows an integrated circuit manufacturing system for generating an integrated circuit embodying a graphics processing system.

    [0100] The accompanying drawings illustrate various examples. The skilled person will appreciate that the illustrated element boundaries (e.g., boxes, groups of boxes, or other shapes) in the drawings represent one example of the boundaries. It may be that in some examples, one element may be designed as multiple elements or that multiple elements may be designed as one element. Common reference numerals are used throughout the figures, where appropriate, to indicate similar features.

    DETAILED DESCRIPTION

    [0101] The following description is presented by way of example to enable a person skilled in the art to make and use the invention. The present invention is not limited to the embodiments described herein and various modifications to the disclosed embodiments will be apparent to those skilled in the art.

    [0102] Embodiments will now be described by way of example only.

    [0103] In the exemplary tile-based deferred rendering architecture, geometry processing is generally executed on one core, while fragment shading may be distributed across multiple cores. This parallelisation helps to speed up the rendering process. However, in scenes with very complex geometry, and therefore a large volume of geometry data, the geometry processing stage may be the rate-limiting step, since it needs to be carried out on a single core.

    [0104] The problem becomes more pronounced as the number of cores increases. As fragment shading is divided among a greater number of cores, the benefits of parallel processing mean that it can be completed faster. Therefore, it becomes even more likely that the geometry processing (which cannot simply be parallelised in the same way) becomes the bottleneck. Scalability suffers as a result, because increasing the number of cores delivers less and less benefit, in terms of speeding up the overall rendering process.

    [0105] A further challenge arises because of the relationship between the graphics pipeline and the applications it services. An application and client driver will issue commands to the GPU in a particular order. The application/client driver is entitled to rely on the understanding that commands will be processed by the GPU in the order that they are issued. If commands are processed out of sequence, unexpected (and erroneous) behaviour may result.

    [0106] It would be desirable to speed up the rendering of frames with complex geometry, in spite of the difficulties associated with parallelising the geometry processing. It would be particularly desirable to do this without requiring significant changes to the hardware pipeline, and in a manner that is transparent to the client driver. In particular, it would be desirable to continue to meet all of the conditions that are met for the existing client driver, in order that software using the client driver does not have to be rewritten.

    [0107] Some commands are dependent on the completion of earlier commands, but some are not. For example, in some cases, commands associated with successive frames may be independent of one another. Also, in some cases, the client driver may issue multiple commands within a single frame which are also independent of one another.

    [0108] In examples according to the present disclosure, firmware of the GPU can analyse the sequence of commands issued by the client driver, to determine which ones are dependent on the results of previous commands and which are not. This information can be extracted from “fences” set by the application or client driver. When the firmware finds a geometry command that is not dependent on the results of previous commands (which have not yet completed executing), it executes that command on the next available core. In the context of tile-based rendering, it will be understood that a geometry command is a command that instigates geometry phase processing. Thus, this allows independent geometry processing to proceed in parallel on multiple cores concurrently.

    [0109] In particular, the firmware examines the command buffer (which is a circular command buffer—or CCB—in the present implementation). According to the exemplary GPU architecture, the circular command buffer had three pointers associated with it: a write offset, a read offset, and a dependency offset.

    [0110] The write offset indicates the location where the client driver should write the next command. The read offset indicates the location of the next geometry command in the buffer to be executed. The dependency offset provides a guarantee that nothing before the dependency offset is waiting for any unsatisfied fences. In practice, if it has been updated by the firmware to advance it as far as possible, the dependency offset typically indicates the location of the next command that is dependent on completion of one or more previous commands.

    [0111] According to the exemplary GPU architecture, several circular command buffers may exist—one for each type of work performed by GPU. For example, as well as the geometry CCB, there may be a fragment CCB for fragment processing commands, and a compute CCB, for compute commands. The dependency offset facilitates synchronisation between the different CCBs. For example, a fragment processing command will be dependent on completion of a respective geometry processing command. A dependency indicator would be inserted in the fragment CCB, to reflect this. Either a fragment processing command or a geometry processing command may depend on the result of a compute command. If so, the fragment CCB or the geometry CCB would contain a dependency indicator to reflect this, as appropriate. It is also possible that that there may be multiple command buffers of the same type—e.g. multiple geometry CCBs. This may occur when different (unrelated) applications issue work to the same GPU, with different command buffers being created for each of the different applications. For the sake of simplicity, the following description assumes the presence of a single geometry command buffer, but it will be understood that when multiple geometry command buffers are present they could each be processed as described.

    [0112] In examples according to the present disclosure, this capability to track conditional dependencies between different commands in different CCBs is exploited to determine and respect conditional dependencies between different commands in the same CCB—namely, geometry commands in a geometry CCB.

    [0113] In examples according to the present disclosure, a second read offset is introduced. When the geometry processing for a given geometry command is sent (“kicked”) to a first core, the first read offset stops at that command. The firmware keeps moving the second read offset forwards. If it finds a further geometry command before it encounters the dependency offset, then it knows that this command is not dependent on the calculations currently being performed by the first core. It therefore kicks the further geometry command to a second core, to process the geometry in parallel with the ongoing processing in the first core.

    [0114] In other words, one read offset is provided per core allocated to geometry processing. Whenever the respective core is idle, the firmware advances (that is, moves forward) the read offset for that core, searching for further geometry processing to execute, until the dependency offset is encountered. In this way, geometry processing can be carried out in parallel across multiple cores. The implementation is largely transparent to the client driver. The client driver should not insert unnecessary fences between geometry commands (as that could prevent the full benefit of parallelisation being achieved). At the same time, the client driver should explicitly highlight any implied dependencies within the CCB, using dependency fences where appropriate. In other words, the client driver is expected to be clear about dependencies within the CCB, and to use them sparingly where necessary. If the client driver overstates the dependencies, this may have a negative impact on performance; if the client driver understates the dependencies, it may create the potential for incorrect rendering. Otherwise, the client driver writes commands to the circular command buffer in the same way that it always did. It remains unaware of whether the geometry processing is performed by a single core or by multiple cores.

    [0115] The commands are processed with no gaps. In other words, each read offset is advanced only until it finds the first non-executed command, which is then executed by the associated core.

    [0116] Ordinarily, when the geometry processing associated with a particular command has finished executing, the firmware performs “fence updates”, to update the client driver about what memory can now be reused by the driver. This behaviour is modified according to examples of the present invention because, ordinarily, the fence updates would lead the client driver to believe that all commands in the command buffer up to this point have now completed executing. This is no longer necessarily the case when the geometry processing associated with different commands is being handled by different cores. The geometry processing for a later command may finish executing while the geometry processing for an earlier command is still ongoing.

    [0117] To accommodate this in the multicore case, the firmware only performs fence updates based on the trailing read offset. For any read offsets that are ahead of the trailing one, no fence updates are performed when geometry processing finishes. This avoids the risk of incorrect signalling to the client driver.

    [0118] Some other modifications are also necessary to the memory management mechanisms. Each core handling geometry processing is allocated its own parameter buffer, so that each has control of its own memory pool, independent of the other(s). As explained above, the parameter buffer is filled by performing geometry processing (including tiling), and the contents are consumed during fragment shading.

    [0119] In the multicore case, a parameter buffer populated by one core doing geometry processing may be consumed by several cores performing fragment shading (possibly including the core that did the geometry processing). Each of these cores performs a fragment shading task that involves fragment shading for one or more tiles. Since the individual (other) cores handling the fragment shading will not have complete knowledge about the use of memory in one parameter buffer—and since they may be handling fragment shading for more than one geometry core (and therefore using more than one parameter buffer) at different times—only the core that originally did the geometry processing is allowed to deallocate the memory. Thus, deallocation is left to the core that is responsible for that particular parameter buffer. This ensures that only the core that has the full overview of the contents of a parameter buffer is able to deallocate memory within that parameter buffer.

    [0120] When a fragment shading core completes a fragment shading task, it signals to the relevant “master” geometry core that it is finished. The “master” geometry core then deallocates/releases the associated memory space in the parameter buffer, once all of the fragment shading cores have finished their processing.

    [0121] It is possible that a geometry processing task is interrupted while running on a core. This can occur, in particular, due to multitasking and context switching. For example, firmware may context switch a geometry command if it needs to execute a higher priority geometry command from a more important software application. This is decided at runtime, and the firmware will return to the interrupted geometry command later. When this happens, the interrupted geometry processing task may maintain an affinity for the core on which it was running, such that it will only resume execution on the same core. This helps to simplify the signalling between the fragment shading cores and the geometry cores—the signalling for memory deallocation always goes back to the core that is the “master” for the relevant parameter buffer. There is no risk that ownership of the parameter buffer switches between geometry cores, while a fragment shading core is doing its work.

    [0122] FIG. 1 is a simplified schematic block diagram showing a GPU 10 according to an example. The GPU comprises first core 12; a second core 14; and a circular command buffer (CCB) 18. Note that the CCB 18 shown in the drawing is the CCB for geometry commands. The GPU has other CCBs (not shown) for other types of commands.

    [0123] FIG. 2 shows the contents of a part of the CCB 18. A first read offset Roff for the first core is pointing to a first geometry command GEOM0; a second read offset Roff2 for the second core is pointing to a second geometry command GEOM1. The CCB 18 also includes update commands U0, U1, U2 (also known as “fence updates”), and dependency indicators D1, D2. These are set by the client driver. A dependency offset Doff is pointing to the first dependency indicator that has not yet been satisfied, D2. In the example illustrated in FIG. 2, the dependency indicator D1 before the second geometry command GEOM1 has already been satisfied (e.g. perhaps by a separate compute command being processed). Accordingly, the second geometry command GEOM1 is not dependent on the results of the first geometry command GEOM0, and these two geometry commands can be processed in an overlapping fashion by the first and second cores, respectively.

    [0124] FIG. 3 illustrates a method carried out by the GPU 10, based on processing the CCB shown in FIG. 2. In step 102, the client driver writes geometry commands to the CCB 18. Each geometry command is written to the CCB at the position pointed to by a write offset Woff (not shown in FIG. 2). The write offset is advanced after every command is written. As the command buffer is a circular command buffer in this example, when the write offset reaches the end of the buffer it wraps around back in the start again. (This is true for each of the offset pointers.)

    [0125] In step 104, the GPU advances the dependency offset Doff to the first dependency indicator that has not yet been satisfied D2. In step 106, the first core 12 reads the first geometry command GEOM0 at the first read offset Roff. Meanwhile, the second core 14 advances the second read offset Roff2 to the next geometry command GEOM1. Here, it reads the second geometry command GEOM1. Because it reached the second geometry command GEOM1 before it reached the dependency offset Doff, the second core can determine (in step 117) that the second geometry command GEOM1 does not depend on the output of the first geometry command GEOM0. Once this has been determined, the second core 14 knows that it is safe to execute the second geometry command GEOM1 (irrespective of whether execution of the first geometry command by the first core is completed yet). Having made this determination in step 117, the second core executes the second geometry command in step 118. In the present implementation, the execution of the second geometry command (by the second core) starts shortly after execution of the first geometry command (by the first core) and both proceed in parallel. It does not matter which execution finishes first.

    [0126] In step 109, the first core 12 writes the results of its geometry processing (executing the first geometry command) to a first parameter buffer. The first parameter buffer is reserved for writing by the first core, and the first core retains responsibility for allocating and deallocating memory in this parameter buffer. Meanwhile, in step 119, the second core 14 writes the results of its geometry processing (executing the second geometry command) to a second parameter buffer. The second parameter buffer is reserved for writing by the second core, and the second core retains responsibility for allocating and deallocating memory in this parameter buffer.

    [0127] Each core requests a memory allocation for its parameter buffer whenever it needs to write data (in particular, tile control streams and parameter blocks) if it does not have sufficient memory already allocated to write it. The later deallocation of memory will depend on signalling from the other cores performing fragment processing. This will be described with reference to FIG. 4.

    [0128] As explained already above, it is known to parallelise fragment processing. For example, in the present tile-based architecture, fragment processing may be distributed among a plurality of cores by allocating a subset of the tiles to each core. There will typically be several cores handling fragment processing. In the present example, at least the first core 12 is involved in handling fragment processing for the results of its geometry processing. That is, the first core does at least some of the fragment processing for the results of the first geometry command. Likewise, the second core does at least some of the fragment processing for the results of the second geometry command. In addition, there may be cores in the system that are configured to perform fragment processing but not geometry processing. In other examples, the core involved with performing the geometry processing for a command may not be involved with the fragment processing other than to perform the memory deallocation mentioned above and below.

    [0129] In step 202, fragment processing is executed. Fragment processing is based on the data produced as output in the geometry processing phase. That is, fragment processing consumes the contents of the parameter buffer, which was written in the geometry processing phase. When a given fragment processing command has finished executing, the parameter buffer contents that it has consumed are no longer required, and the associated memory can be freed. According to the present implementation, the core that did the geometry processing work also does some of the fragment processing work. Any other core that is executing the fragment processing signals to the core that originally did the geometry processing work, when it has completed its part of the fragment processing—that is, when execution of the fragment processing command on that other core is finished. In particular, after the second core finishes executing fragment processing based on the results of the first geometry command, it will signal this (in step 204) to the first core. In response, once all cores (including the first core) handling the fragment processing based on the results of the first geometry command have finished the fragment processing, the first core frees the relevant allocated memory in the first parameter buffer (step 206). Likewise, after the first core finishes executing fragment processing based on the results of the second geometry processing command, it will signal this (in step 214) to the second core. In response, once all cores (including the second core) handling the fragment processing based on the results of the second geometry command have finished the fragment processing, the second core will free the relevant allocated memory in the second parameter buffer (in step 216). Because each core has its own parameter buffer, there is no risk of one core deallocating and freeing memory that is still in use by another core.

    [0130] Referring once again to the exemplary CCB contents shown in FIG. 2, let us consider what will happen after the first core finishes executing the first geometry command GEOM0 and the second core finishes executing the second geometry command GEOM1. Let us assume that the first core finishes its work first. In the example of FIG. 2, the first core will advance the first read offset Roff to the update command U0. Because Roff<Roff2 (i.e. Roff is behind Roff2 in the CCB), the first core executes the fence update command U0. That is, because the first core currently has the trailing read offset, it executes the fence update command (step 110 in FIG. 3). Update commands should only be executed once, and they should be executed by the core whose read offset is currently trailing. The first read offset Roff is then advanced again, to reach the second geometry command GEOM1. (Note that the dependency indicator D1 is ignored when advancing Roff—it has already been considered when positioning the dependency offset Doff.) At this point, Roff=Roff2 and Roff2<Doff; therefore, the first core knows that the second core will already be executing the second geometry command GEOM1. The first core therefore skips the second geometry command and advances the first read offset Roff, again. Now, Roff reaches the next update command U1. Because Roff>Roff2 at this point (i.e. Roff is ahead of Roff2 in the CCB), the first core knows that it must skip the fence update. This fence update will be carried out later, by the second core (whose read offset Roff2 is now trailing) when it completes execution of the second geometry command GEOM1. The first read offset Roff is advanced again. It reaches the dependency offset Doff; therefore it stops, as no core should execute any geometry commands beyond the dependency offset.

    [0131] If the dependency offset Doff could be moved forward at this point, beyond the third geometry command GEOM2, then the first core would advance Roff to this command GEOM2 and would read and execute it.

    [0132] The second core meanwhile completes execution of the second geometry command GEOM1. The second core will advance the second read offset Roff2 to the fence update command U1. Because Roff2 is the trailing read offset by this stage (that is Roff2<Roff), the second core executes the update command U1.

    [0133] Note that, in some examples, a geometry command may be split by the client driver into two or more parts. In this case, the parts behave in the same way as when geometry processing is interrupted while running on a core—the subsequent parts maintain an affinity for the core that processed the first part. Splitting by the client driver may occur for various reasons. For example, the client driver may wish to pause one geometry command, start another, and return to the first one later.

    [0134] The two cores doing geometry processing can proceed in this way, skipping ahead of one another, until all of the geometry processing work is complete. This pattern of overlapping execution has the potential to significantly increase the throughput of geometry processing. This will be explained by reference to the comparative example illustrated schematically in FIG. 5A, and the example illustrated in FIG. 5B. In FIG. 5A, geometry commands are executed sequentially by a single core. The total time taken is the sum of the times taken to execute individual commands. GEOM0 is executed in two parts. This may be a result of context switching, or a result of splitting of the command into two parts by the client driver, as explained above. In FIG. 5B, independent geometry commands are executed by two cores according to the pattern explained above. The total time taken in this example is the sum of the times taken for the two parts of GEOM0 (since these are now executed sequentially by Core 1, and this takes longer than the execution of GEOM1, GEOM2 and GEOM3 by Core 2). Note that the second part of GEOM0 executes on the same core as the first part, following the rule about affinity explained below.

    [0135] Thus, it can be seen that, according to the present example, geometry processing can be scaled effectively across multiple cores. This can be done without requiring the client driver to make any decisions about how to divide up the work.

    [0136] As mentioned previously above, once a core has started executing a given geometry command, that command has an “affinity” set for that core. This means that, if the execution is interrupted (for example, due to multitasking and/or context switching) it will only resume on that same core, and not any other core. This simplifies memory management and can also help to reduce the use of memory access bandwidth. The cores performing fragment processing do not need to check whether geometry processing has been interrupted on a first core and resumed on a second core (in which case the second core would become responsible for deallocating and freeing the associated parameter buffer space). A core performing fragment processing has a guarantee that one specific core was responsible for the geometry processing and that the same core remains responsible for tidying up the memory management. Additionally, if a geometry processing command is interrupted and resumes on the same core, it is possible that some of the primitive data that was being processed by the command may still persist in a cache of that core. This may help to avoid reading it from external memory a second time after resumption.

    [0137] It should be understood that the scope of the present disclosure is not limited to the examples above. Many variations are possible, including but not limited to the following.

    [0138] The examples described above used two cores for geometry processing; however, it should be understood that this is non-limiting—a greater number of geometry cores could be used in other examples.

    [0139] Although the command buffer in the present implementation is a circular command buffer, this is not essential. Other suitable memory structures might include a linked list, for example.

    [0140] In the present implementation, geometry cores are started strictly sequentially. That is, each core is only started after the others have started their geometry processing. This may be beneficial for simplicity of design and implementation; however, it is not essential. Geometry cores could start their geometry processing simultaneously, in some implementations.

    [0141] As mentioned earlier, it is possible for multiple geometry command buffers to be present, and each buffer may be processed by multiple cores as described above. That is, multiple different cores may work on the different command buffers at the same time (i.e. because there are no dependencies between the commands in one geometry CCB and another geometry CCB).

    [0142] Moreover, depending on the workloads and core availabilities, the different buffers may each be processed by different single cores in parallel. It may even be the case that those cores swap workloads, e.g. if it is preferable to only use a specific core for geometry processing if the other core(s) are busy. So, in an example scenario, a first core ‘core0’ may be working on a first command buffer ‘CCB0’ and so a second core ‘core1’ starts working on another command buffer ‘CCB1’. When core0 stops working on CCB0 (either because all the CCB0 commands have been processed or because an unfulfilled dependency has been met) it may start processing CCB1. That processing may continue in parallel with core1 also processing CCB1 or, if it becomes desirable to use core1 for something else, core1 may cease processing CCB1 so that core0 is processing CCB1 alone. Further, assuming core0 stopped processing CCB0 due to an unfulfilled dependency, and that dependency is fulfilled after both core0 begins working on CCB1 and core 1 has stopped working on CCB1, core1 may then start processing CCB0, such that both command buffers are now being processed in parallel again, but now by different cores compared to the start of the scenario.

    [0143] FIG. 6 shows a computer system in which the graphics processing systems described herein may be implemented. The computer system comprises a CPU 902, a GPU 904, a memory 906 and other devices 914, such as a display 916, speakers 918 and a camera 919. A processing block 910 (corresponding to GPU 10) is implemented on the GPU 904. The components of the computer system can communicate with each other via a communications bus 920.

    [0144] The GPU of FIG. 1 is shown as comprising a number of functional blocks. This is schematic only and is not intended to define a strict division between different logic elements of such entities. Each functional block may be provided in any suitable manner. It is to be understood that intermediate values described herein as being formed by a GPU need not be physically generated by the GPU at any point and may merely represent logical values which conveniently describe the processing performed by the GPU between its input and output.

    [0145] The GPUs described herein may be embodied in hardware on an integrated circuit. The GPUs described herein may be configured to perform any of the methods described herein. Generally, any of the functions, methods, techniques or components described above can be implemented in software, firmware, hardware (e.g., fixed logic circuitry), or any combination thereof. The terms “module,” “functionality,” “component”, “element”, “unit”, “block” and “logic” may be used herein to generally represent software, firmware, hardware, or any combination thereof. In the case of a software implementation, the module, functionality, component, element, unit, block or logic represents program code that performs the specified tasks when executed on a processor. The algorithms and methods described herein could be performed by one or more processors executing code that causes the processor(s) to perform the algorithms/methods. Examples of a computer-readable storage medium include a random-access memory (RAM), read-only memory (ROM), an optical disc, flash memory, hard disk memory, and other memory devices that may use magnetic, optical, and other techniques to store instructions or other data and that can be accessed by a machine.

    [0146] The terms computer program code and computer readable instructions as used herein refer to any kind of executable code for processors, including code expressed in a machine language, an interpreted language or a scripting language. Executable code includes binary code, machine code, bytecode, code defining an integrated circuit (such as a hardware description language or netlist), and code expressed in a programming language code such as C, Java® or OpenCL. Executable code may be, for example, any kind of software, firmware, script, module or library which, when suitably executed, processed, interpreted, compiled, executed at a virtual machine or other software environment, cause a processor (in particular, a GPU) of the computer system at which the executable code is supported to perform the tasks specified by the code.

    [0147] A processor, computer, or computer system may be any kind of device, machine or dedicated circuit, or collection or portion thereof, with processing capability such that it can execute instructions. A processor may be any kind of general purpose or dedicated processor, such as a CPU, GPU, NNA, System-on-chip, state machine, media processor, an application-specific integrated circuit (ASIC), a programmable logic array, a field-programmable gate array (FPGA), or the like. A computer or computer system may comprise one or more processors.

    [0148] It is also intended to encompass software which defines a configuration of hardware as described herein, such as HDL (hardware description language) software, as is used for designing integrated circuits, or for configuring programmable chips, to carry out desired functions. That is, there may be provided a computer readable storage medium having encoded thereon computer readable program code in the form of an integrated circuit definition dataset that when processed (i.e. run) in an integrated circuit manufacturing system configures the system to manufacture a GPU configured to perform any of the methods described herein, or to manufacture a GPU comprising any apparatus described herein. An integrated circuit definition dataset may be, for example, an integrated circuit description.

    [0149] Therefore, there may be provided a method of manufacturing, at an integrated circuit manufacturing system, a GPU as described herein. Furthermore, there may be provided an integrated circuit definition dataset that, when processed in an integrated circuit manufacturing system, causes the method of manufacturing a GPU to be performed.

    [0150] An integrated circuit definition dataset may be in the form of computer code, for example as a netlist, code for configuring a programmable chip, as a hardware description language defining hardware suitable for manufacture in an integrated circuit at any level, including as register transfer level (RTL) code, as high-level circuit representations such as Verilog or VHDL, and as low-level circuit representations such as OASIS® and GDSII. Higher level representations which logically define hardware suitable for manufacture in an integrated circuit (such as RTL) may be processed at a computer system configured for generating a manufacturing definition of an integrated circuit in the context of a software environment comprising definitions of circuit elements and rules for combining those elements in order to generate the manufacturing definition of an integrated circuit so defined by the representation. As is typically the case with software executing at a computer system so as to define a machine, one or more intermediate user steps (e.g. providing commands, variables etc.) may be required in order for a computer system configured for generating a manufacturing definition of an integrated circuit to execute code defining an integrated circuit so as to generate the manufacturing definition of that integrated circuit.

    [0151] An example of processing an integrated circuit definition dataset at an integrated circuit manufacturing system so as to configure the system to manufacture a GPU will now be described with respect to FIG. 7.

    [0152] FIG. 7 shows an example of an integrated circuit (IC) manufacturing system 1002 which is configured to manufacture a GPU as described in any of the examples herein. In particular, the IC manufacturing system 1002 comprises a layout processing system 1004 and an integrated circuit generation system 1006. The IC manufacturing system 1002 is configured to receive an IC definition dataset (e.g. defining a GPU as described in any of the examples herein), process the IC definition dataset, and generate an IC according to the IC definition dataset (e.g. which embodies a GPU as described in any of the examples herein). The processing of the IC definition dataset configures the IC manufacturing system 1002 to manufacture an integrated circuit embodying a GPU as described in any of the examples herein.

    [0153] The layout processing system 1004 is configured to receive and process the IC definition dataset to determine a circuit layout. Methods of determining a circuit layout from an IC definition dataset are known in the art, and for example may involve synthesising RTL code to determine a gate level representation of a circuit to be generated, e.g. in terms of logical components (e.g. NAND, NOR, AND, OR, MUX and FLIP-FLOP components). A circuit layout can be determined from the gate level representation of the circuit by determining positional information for the logical components. This may be done automatically or with user involvement in order to optimise the circuit layout. When the layout processing system 1004 has determined the circuit layout it may output a circuit layout definition to the IC generation system 1006. A circuit layout definition may be, for example, a circuit layout description.

    [0154] The IC generation system 1006 generates an IC according to the circuit layout definition, as is known in the art. For example, the IC generation system 1006 may implement a semiconductor device fabrication process to generate the IC, which may involve a multiple-step sequence of photo lithographic and chemical processing steps during which electronic circuits are gradually created on a wafer made of semiconducting material. The circuit layout definition may be in the form of a mask which can be used in a lithographic process for generating an IC according to the circuit definition. Alternatively, the circuit layout definition provided to the IC generation system 1006 may be in the form of computer-readable code which the IC generation system 1006 can use to form a suitable mask for use in generating an IC.

    [0155] The different processes performed by the IC manufacturing system 1002 may be implemented all in one location, e.g. by one party. Alternatively, the IC manufacturing system 1002 may be a distributed system such that some of the processes may be performed at different locations, and may be performed by different parties. For example, some of the stages of: (i) synthesising RTL code representing the IC definition dataset to form a gate level representation of a circuit to be generated, (ii) generating a circuit layout based on the gate level representation, (iii) forming a mask in accordance with the circuit layout, and (iv) fabricating an integrated circuit using the mask, may be performed in different locations and/or by different parties.

    [0156] In other examples, processing of the integrated circuit definition dataset at an integrated circuit manufacturing system may configure the system to manufacture a GPU without the IC definition dataset being processed so as to determine a circuit layout. For instance, an integrated circuit definition dataset may define the configuration of a reconfigurable processor, such as an FPGA, and the processing of that dataset may configure an IC manufacturing system to generate a reconfigurable processor having that defined configuration (e.g. by loading configuration data to the FPGA).

    [0157] In some embodiments, an integrated circuit manufacturing definition dataset, when processed in an integrated circuit manufacturing system, may cause an integrated circuit manufacturing system to generate a device as described herein. For example, the configuration of an integrated circuit manufacturing system in the manner described above with respect to FIG. 7 by an integrated circuit manufacturing definition dataset may cause a device as described herein to be manufactured.

    [0158] In some examples, an integrated circuit definition dataset could include software which runs on hardware defined at the dataset or in combination with hardware defined at the dataset. In the example shown in FIG. 7, the IC generation system may further be configured by an integrated circuit definition dataset to, on manufacturing an integrated circuit, load firmware onto that integrated circuit in accordance with program code defined at the integrated circuit definition dataset or otherwise provide program code with the integrated circuit for use with the integrated circuit.

    [0159] The implementation of concepts set forth in this application in devices, apparatus, modules, and/or systems (as well as in methods implemented herein) may give rise to performance improvements when compared with known implementations. The performance improvements may include one or more of increased computational performance, reduced latency, increased throughput, and/or reduced power consumption. During manufacture of such devices, apparatus, modules, and systems (e.g. in integrated circuits) performance improvements can be traded-off against the physical implementation, thereby improving the method of manufacture. For example, a performance improvement may be traded against layout area, thereby matching the performance of a known implementation but using less silicon. This may be done, for example, by reusing functional blocks in a serialised fashion or sharing functional blocks between elements of the devices, apparatus, modules and/or systems. Conversely, concepts set forth in this application that give rise to improvements in the physical implementation of the devices, apparatus, modules, and systems (such as reduced silicon area) may be traded for improved performance. This may be done, for example, by manufacturing multiple instances of a module within a predefined area budget.

    [0160] The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.