System and method of data processing
09582321 ยท 2017-02-28
Assignee
Inventors
Cpc classification
G06F9/3887
PHYSICS
G06F9/3888
PHYSICS
G06F9/3885
PHYSICS
G06F9/4881
PHYSICS
International classification
G06F15/00
PHYSICS
G06F9/30
PHYSICS
G06F9/38
PHYSICS
G06F9/455
PHYSICS
Abstract
A data processing apparatus, a data processing method and a computer program product are disclosed. In an embodiment, the data processing apparatus comprises: a processor comprising a plurality of parallel lanes for parallel processing of sets of threads, each lane comprising a plurality of pipelined stages, the pipelined stages of each lane being operable to process instructions from the sets of threads; and scheduling logic operable to schedule instructions for processing by the lanes, the scheduling logic being operable to identify that one of the sets of threads being processed is to be split into a plurality of sub-sets of threads and to schedule at least two of the plurality of sub-sets of threads for processing by different pipelined stages concurrently.
Claims
1. A data processing apparatus comprising: a plurality of parallel lanes for parallel processing of sets of threads, each lane comprising a plurality of pipelined stages, said pipelined stages of each lane being operable to process instructions from said sets of threads; and scheduling logic operable to schedule instructions for processing by said lanes, said scheduling logic being operable to identify that one of said sets of threads being processed is to be split into a plurality of sub-sets of threads and to schedule at least two of said plurality of sub-sets of threads for processing by different pipelined stages concurrently, wherein said scheduling logic is operable to identify that a possible divergence in an instruction pointer value may occur which is identified by a divergence indicator within a thread included within said identified set of threads and to perform an adjustment of a reconvergence counter for said set of threads from an initial value on an occurrence of said divergence indicator.
2. The data processing apparatus of claim 1, wherein each thread within said set of threads scheduled for processing shares a common instruction pointer value identifying at least one common instruction for parallel processing by said lanes.
3. The data processing apparatus of claim 1, wherein said scheduling logic comprises storage operable to store an indication of said set of threads and each of said sub-sets of threads.
4. The data processing apparatus of claim 3, wherein said indication of said set of threads and each of said sub-sets of threads comprises an instruction pointer value associated with each thread.
5. The data processing apparatus of claim 1, wherein said scheduling logic is operable to identify that said set of threads is to be split into said plurality of sub-sets of threads when different instructions are identified for parallel processing by the same stage within said lanes.
6. The data processing apparatus of claim 1, wherein said scheduling logic is operable to identify that said set of threads is to be split into said plurality of sub-sets of threads when a divergence in instruction pointer value occurs for at least one thread of said set of threads.
7. The data processing apparatus of claim 1, wherein said scheduling logic is operable to include, in each sub-set of threads, those threads sharing a common instruction pointer value identifying at least one common instruction for parallel processing by said lanes.
8. The data processing apparatus of claim 1, wherein said scheduling logic is operable to identify recursively that one of said sub-sets of threads being processed is to be split into a further plurality of sub-sets of threads and to schedule at least two sub-sets of threads for processing by different pipelined stages concurrently.
9. The data processing apparatus of claim 1, wherein said scheduling logic is operable to schedule any at least two sub-sets of threads which have not themselves been split recursively into further sub-sets for processing by different pipelined stages concurrently.
10. The data processing apparatus of claim 1, wherein said scheduling logic is operable to prevent said set of threads from being scheduled for processing until said plurality of sub-sets of threads have reconverged again to share a common instruction pointer value.
11. The data processing apparatus of claim 8, wherein said scheduling logic is operable to reform a sub-set of threads when every further sub-set split from that sub-set of threads has reconverged again to share a common instruction pointer value.
12. The data processing apparatus of claim 1, wherein said scheduling logic is operable to reverse said adjustment of said reconvergence counter for said sub-set of threads when a possible reconvergence identified by a reconvergence indicator occurs within that sub-set of threads.
13. The data processing apparatus of claim 1, wherein said scheduling logic is operable to determine that a sub-set of threads has reached a possible reconvergence when said reconvergence counter returns to said initial value.
14. The data processing apparatus of claim 1, comprising logic operable to annotate an instruction stream comprising said instructions to provide said divergence and reconvergence indicators.
15. The data processing apparatus of claim 1, wherein said divergence and reconvergence indicators identify at least one of a single-entry single-exit region and a region of unstructured code.
16. The data processing apparatus of claim 1, wherein said scheduling logic is operable, on occurrence of a store indicator within a set of threads, to store in alternative storage contents of storage associated with said set of threads.
17. The data processing apparatus of claim 1, wherein said scheduling logic is operable, on occurrence of a restore indicator, to determine whether said set of threads within which said restore indicator occurred matches said set of threads whose contents are stored in said alternative storage and, if so, to overwrite contents in said storage associated with said set of threads with said contents from said alternative storage except for said instruction pointer value.
18. The data processing apparatus of claim 1, wherein said scheduling logic is operable, on occurrence of a restore indicator, to determine whether said set of threads within which said restore indicator occurred matches said set of threads whose contents are stored in said alternative storage and, if not, to remove an indication of said set of threads within which said restore indicator occurred from said contents and from contents associated with sub-sets of threads of said set of threads within which said restore indicator occurred prior to overwriting contents in said storage associated with said set of threads with said contents from said alternative storage except for said instruction pointer value and said indication of said set of threads.
19. A data processing method of scheduling instructions for processing by a data processing apparatus comprising a plurality of parallel lanes for parallel processing of sets of threads, each lane comprising a plurality of pipelined stages, said pipelined stages of each lane being operable to process instructions from said sets of threads, said method comprising: identifying that one of said sets of threads being processed is to be split into a plurality of sub-sets of threads; scheduling at least two of said plurality of sub-sets of threads for processing by different pipelined stages concurrently; identifying that a possible divergence in an instruction pointer value may occur which is identified by a divergence indicator within said identified one of said sets of threads; and, performing an adjustment of a reconvergence counter for said set of threads from an initial value on an occurrence of the divergence indicator.
20. A computer program product operable, when executed on a computer, to perform the method steps of claim 19.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) Embodiments of the present invention will now be described further, with reference to the accompanying drawings, in which:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
DESCRIPTION OF THE EMBODIMENTS
(10) General Architecture
(11)
(12) The host processor 101 is a processor optimized for single-threaded processing. It can be a standard single or multi-core central processing unit (CPU), which may or may not use the same instruction set as the co-processor 102, or it may be omitted. If omitted, the co-processor 102 will perform the host-processor's 101 tasks.
(13) In one embodiment, the co-processor 102 is a Coherent Vector Threaded processor comprising any of the features of a reconvergence mechanism described in more detail below. The host processors 101 and co-processors 102 have a control communication link, here illustrated in the form of a control bus 103, which provides the host processor 101 with a channel to reset, configure, start and stop the co-processors 102. In addition, the host processors 101 and the co-processors 102 may have the ability to send each other interrupts and simple messages though the control bus 103.
(14) The host processors 101 and co-processors 102 are, together or individually, coupled to at least one memory bus 104, which provides access to a main memory (not shown), usually within another computer chip, but possibly on the same chip or in another form of storage.
(15) Processor Architecture
(16)
(17)
(18) The scheduling logic 204 retains (or accesses in other embodiments) information about the current state of the thread set in order to make decisions on which threads to schedule for processing at any particular time. In this example, the information is stored in a thread set memory 205 which, in this example is implemented as a single table although it will be appreciated that other implementation are possible where the information is stored in a more distributed or compact form.
(19) The execution unit 203 also includes a path 290 over which certain characteristics of the instructions being executed within the pipelined stages are reported to a control flow unit 202. In this example, the path 290 is provided at stage S2, but it will be appreciated that the path 290 may be provided elsewhere, depending on the particular implementation. For example, the path 290 is used to report whether a divergence in instruction pointer value occurs for a thread set being executed by pipelined stage S2, or whether a particular instruction is being processed. The control flow unit 203 uses this information to update the thread set memory 205.
(20) The thread set memory 205 contains a number of entries to identify both the thread sets and their relationships to assist the scheduling logic 204 when the scheduling instructions to be issued to each lane L0 to L3 of the execution unit 203. In particular, the thread set memory 205 includes an entry for each thread set or thread sub-set. In this example, each entry contains an instruction pointer field 250, a thread set mask field 260, a parent pointer field 270 and a depth counter field 280. The instruction pointer field 250 identifies, for each entry, the next instruction to be executed by the threads in the thread set identified for that entry. The thread set mask field 260 identifies the threads which are currently executable, as well as the lanes within which they may be executed (although in other embodiments the lane mapping is determined by the scheduler). The parent pointer field 270 identifies whether the entry has a parent thread set and, if so, which entry in the table is that parent thread set. The depth counter field 280 is used to reduce the number of entries in the table which may otherwise be made if a possible divergence is indicated within a thread (which would otherwise lead to further entries requiring to be made within the instruction thread set memory 205). Accordingly, the provision of the depth counter field 280 enables a degree of compression of entries within the instruction thread set memory 205 to be achieved and may also be used in other embodiments to selecting which thread set to schedule.
(21) Code Annotation
(22) In order to utilise the functionality of the thread set memory 205 to provide a reconvergence mechanism, the code being executed by the co-processor 102 needs to provide certain information relating to the existence of possible divergent points or convergent points in the control flow so that an assessment of the threads making thread sets and whether those thread sets can be scheduled for execution can be made. Typically, this is achieved by inserting additional instructions into the code or by annotating the existing instructions to identify these points.
(23) In embodiments where the host processor 101 and the co-processor 102 are implementing different instruction sets, code that is to be run on the co-processor will re-compiled from the host processor's to the co-processor's instruction set on demand during run-time through a process known as Just-in-Time compilation (JIT-compilation).
(24) JIT-compilation refers to re-compiling compiled instructions from the host-processor's instruction set, or another instruction set, to the co-processors instruction set. This is achieved by translating the instructions directly into their equivalent or a series of instructions achieving the same result, remapping registers where required. It can involve translating the code into an abstract program representation as an intermediate step, such as a directed acyclic graph, a single assignment form register allocation or an intermediate language. In an alternative embodiment, the code exists for two instruction sets before the program is run, and the distinction between the two is explicit, but requires no JIT-compilation. The code for the two instruction sets is compiled so that data is laid out and accessed in the same way, thereby allowing functions in either of the two versions to operate on the same data structures. The two compiled versions are referred to as Compatibly Compiled Instruction Streams (CCISs). Both these embodiments, JIT-compiling from a compiled instruction set to the co-processor's instruction set or alternatively creating the CCISs, enable seamless handover of tasks between the host-processor and the co-processor. Sections of code that are intended for the co-processor, but which cannot be re-compiled for the co-processor or are identified as inefficient on the co-processor, can run on the host-processor instead. A seamless integration is also available in the simple case where both host-processor and co-processor use the same instruction set.
(25) In one embodiment, compilation, including JIT-compilation or generation of the CCIS, may also include insertion of additional instructions or reconfiguration of the program flow to improve performance or execution flow on a CVT processor. Inserted instructions can include, for example, ENTER_REGION, EXIT_REGION, STORE_REGION, RESTORE_REGION, as explained in further detail below.
(26) Single Entry Single Exit (SeSe) Regions
(27) As part of the compilation, Single Entry Single Exit (SeSe) regions in the code are identified. SeSe regions are identified by applying the following set of criteria to the Instruction Stream:Instruction a is the entry and instruction b the exit to a SeSe Region if: Every control flow path from the first Instruction in the overall control flow to instruction b also reaches instruction a before Every path from Instruction a to the last Instruction in the overall control flow also reaches instruction b Whenever the control flow reaches a it must reach b before being able to reach a again Whenever the control flow reaches b it must reach a before being able to reach b again.
(28) The overall control flow refers to the order in which instructions are executed. First and last instructions refer to the first and the last Instruction that can be executed in the code being compiled. In embodiments, trivial SeSe Regions, where a and b are the same instruction or without control flow instructions, are not relevant. Furthermore, a SeSe region that is just an extension of another SeSe region, without at least one additional control flow instruction is also not relevant. A SeSe region will be used to refer to SeSe Regions where #b, containing control flow instructions and not being a simple extension of another SeSe region without additional control flow instruction.
(29) This definition of SeSe regions is based on two publications which are hereby incorporated by reference in their entirety: Richard Johnson, David Pearson, Keshav Pingali, The Program Structure Tree: Computing Control Regions in Linear Time (1994) and by the same authors, Finding Regions Fast: Single Entry Single Exit and Control Regions in Linear Time (1993).
(30) It will be appreciated that alternative identification methods for SeSe regions can also be utilized. Also, a SeSe region can contain another SeSe region and share entry point a or exit point b. Furthermore, when writing the code, or during compilation, the Instruction Stream can be rewritten to equivalent versions with identical functionality in order to change the number of SeSe regions in the Instruction Stream.
(31) Multiple Entry And Multiple Exit (MeMe) Regions
(32) The rules for SeSe regions defined above do not always apply to all regions within the Instruction Stream. Regions of the code that have more than one entry and/or more than one exit are referred to multiple entry and multiple exit regions. In one embodiment, multiple entry and multiple exit regions, and all regions they contain, are ignored for the purpose of reconvergence. In this case, only the surrounding SeSe region is considered. In another embodiment, the compiler will rewrite multiple entry and multiple exit regions into SeSe regions by changing the instructions. Methods available for rewriting unstructured code flow with multiple entry and multiple exit regions have been summarized by F. Zhang and E. H D'Hollander, Using hammock graphs to structure programs. IEEE Trans. Softw. Eng., pages 231-245 2004 which is hereby incorporated by reference in its entirety.
(33) Multiple entry or multiple exit regions may also be avoided during the code generation by changing the way a program is written, for example by avoiding using the unstructured control flow statements described below. After the SeSe regions have been identified, special instructions for reconvergence as mentioned below, are inserted into the Instruction Stream. If multiple SeSe regions are entered or exited at the same Instruction, then either multiple reconvergence instructions are inserted, or a version of the reconvergence instruction that includes a repeat count is inserted.
(34) During execution of the Instruction Stream a reconvergence mechanism ensures that threads that have been executing in lockstep and have diverged into different paths reconverge typically after as few instructions as possible. This reconvergence is possible when the threads reach the same point in the Instruction Stream again. This ensures that the execution lanes have the highest possible utilization. In one embodiment, the point where threads reconverge is identified by the exit of a SeSe region. In another embodiment, the co-processor 102 will prioritize the threads that are still inside a SeSe region, but may still allow threads to execute beyond the region exit without complete reconvergence, for example while the other threads are stalled and the processor is idle.
(35) SeSe Region Entries and Exits
(36) The location of SeSe region entries and exits are specified using a set of reconvergence instructions. The parameters to these instructions may be in the form of immediate values, they may be stored in registers or in memory, or they may be derived in some other way based on the current thread's state and instruction stream.
(37) In a one embodiment, only one additional instruction is used: ENTER_REGION [reconv.address]. The instruction is inserted at the place where the region starts, and reconv.address is a parameter which gives the address in the Instruction Stream where the region ends.
(38) In another embodiment, one instruction is used to mark the point of entry, and another to mark the point of the exit from the region: ENTER_REGION [id], EXIT_REGION [id] where id is a parameter that identifies the SeSe region. Depending on the underlying architecture, the parameter may be omitted. The id may also refer to a memory location where additional information is stored. In one embodiment, regions with more than one entry and exit may be supported in limited cases by identifying corresponding ENTER_REGION and EXIT_REGION through the id.
(39) The identification of SeSe regions as outlined above is sufficient to support structured program flow, such as but not limited to the following instructions (described as C and C++ examples but which will also equally apply to their corresponding implementation in any other programming language): if-then-else statements do-while or for-loops function calls indirect function calls (where destination is unknown at compile-time and may be different across lanes)
(40) To support popular programming languages, such as C and C++, it is also desirable to support a certain set of unstructured control flow statements, such as, but not limited to: continue break return goto setjump/longjump try/catch and throw
(41) In one embodiment, they will be partially or fully enabled by the compiler by rewriting the resulting multiple entry and/or multiple exit regions into SeSe regions as described above. In another embodiment, the following instructions are inserted into the Instruction Stream for the purpose of handling aforementioned instructions:
(42) STORE_REGION [location]
(43) This instruction records the region state of the executing thread(s), and associates it with the given location. In one embodiment, location is a pointer to memory. In another embodiment, location is a reference to the processor register containing a pointer to memory. In another embodiment, location is a key into a hash-map based data storage and retrieval unit. The location is used to record the region hierarchy for an unstructured jump destination, and allows the processor to construct the correct region state when such an unstructured jump is performed. The STORE_REGION instruction does not have to be located at the jump destination, but it must be at the same place in the region hierarchy. For example, it can be executed before a for-loop to prepare for the fact that a break instruction inside the for-loop might perform an unstructured jump to the instruction following the for-loop.
(44) RESTORE_REGION [location]
(45) This instruction restores the region state of the executing thread(s) from the given location. Location refers to a region state stored by a corresponding STORE_REGION instruction, as mentioned in the exemplary embodiments described above. STORE_REGION and RESTORE_REGION instructions are used to perform unstructured jumps within an otherwise structured part of the program without disrupting the reconvergence logic.
(46) In another embodiment, ENTER_REGION, EXIT_REGION, their parameters, and optionally STORE_REGION and RESTORE_REGION with their parameters, are encoded as part of another instruction. One or more bits of the instruction are reserved and used for the reconvergence instructions and its parameters. In another embodiment, where the Instruction Stream uses Very Large Instruction Word (VLIW) instructions that encode several operations together, ENTER_REGION, EXIT_REGION, their parameters, and optionally STORE_REGION and RESTORE_REGION with their parameters, are encoded inside these VLIW instructions. One or more bits of the instruction are reserved and used for the reconvergence instructions and their parameters.
(47) The implementation described below describes the reconvergence mechanism with ENTER_REGION, EXIT_REGION, STORE_REGION and RESTORE_REGION inserted as discrete instructions. The description presented below applies equally to all cases where the ENTER_REGION, EXIT_REGION and/or STORE_REGION, RESTORE_REGION are encoded as part of other instruction(s).
(48) In one embodiment, the compiler rewrites unstructured code, identified by multiple entry and multiple exit regions, to SeSe regions, as mentioned above. In another embodiment, STORE_REGION and RESTORE_REGION instructions are inserted to take care of unstructured branches as detailed below. These branches can then be ignored when identifying SeSe regions in the pre-processing step, allowing SeSe regions to be found, and reconvergence to be applied even in the presence of unstructured control flow.
(49) The STORE_REGION instruction is inserted by the compiler in parts of the Instruction Stream with unstructured code flow, as in the examples given below:
(50) TABLE-US-00001 Setjump STORE_REGION is inserted as part of the setjump function instructions in the Instruction Stream Try STORE_REGION is inserted together with the instructions added to the Instruction Stream by the try keyword Return/break/continue STORE_REGION is inserted after the ENTER_REGION instruction of the region that contains the next instruction executed after the return, break or continue Goto STORE_REGION is inserted after the ENTER_REGION instruction of the region that contains the next instruction executed after the goto. This can only be done if it is certain that the location of the STORE_REGION will be executed before the goto instruction. If this is not the case, the STORE_REGION/RESTORE_REGION approach cannot be used.
(51) The RESTORE_REGION instruction restores the region state info information for the thread set executing the instruction as if the thread set never progressed past the corresponding STORE_REGION instruction. The examples below illustrate the logic:
(52) TABLE-US-00002 Longjump RESTORE_REGION is inserted as part of the longjump function instructions in the Instruction Stream Catch RESTORE_REGION is inserted together with the instructions added to the Instruction Stream by the catch keyword Return/break/continue RESTORE_REGION is inserted together with the instructions added to the Instruction Stream by the return, break or continue keyword Goto RESTORE_REGION is inserted together with the instructions added to the Instruction Stream by the goto keyword. This can only be done if it is certain that the location of the STORE_REGION will be executed before the goto instruction. If this is not the case, the STORE_REGION/RESTORE_REGION approach cannot be used.
(53)
Thread Set Memory
Example Operation
(54) Returning now to the operation of the scheduling logic 204,
(55) Initialisation
(56) As shown in
(57) In embodiments, the value of the thread set field may be stored in a manner other than a bit mask such as, for example, a pointer to memory which indicates the thread set. In embodiments, rather than each entry in the thread set memory being associated with a particular thread set, instead each thread may be individually identified as an entry in the thread set table. Optionally, in this arrangement, the thread set value may be omitted because the memory stores the information by thread individually. The instruction pointer field 250 indicates the next instruction to be scheduled for each thread in the thread set. Given that this is the root thread set, the value of the parent pointer field 270 is set to X to indicate a null value and the depth counter is set to 0.
(58) Turning now to
(59) In embodiments, the thread sets may be scheduled from one or more different warps. Where different warps are utilized, then the thread set warp is indicated in the thread set memory 205 using, for example, a warp field or other annotation. Alternatively, where multiple warps are reported, then each warp may be provided with its own thread set memory or some memory.
(60) Thread DivergenceCreation of Branches 1 and 2
(61) When the root thread set reaches the ENTER_REGION instruction 502, and this instruction is executed at step S20 by stage S2 of the execution unit 203, then this it is indicated to the control flow unit 202 which determines at step S30 that it has encountered a divergence/reconvergence-related instruction. In this example, it is determined that an ENTER_REGION instruction has been encountered and so at step S40, the value of the depth counter field 280 is increased by 1 to 1 and the instruction continues to be executed.
(62) The occurrence of the control flow instruction 503, it is detected at step S30 and the instruction is continued to be executed at step S50.
(63) At step S60 it is determined that the instruction pointer value for the instruction within lane L0 diverges from the instruction pointer value for lanes L1 to L3.
(64) At step S70, the thread for lane L0 is formed into one group and the threads for lanes L1 to L3 are formed into another group.
(65) Accordingly, at step S80, a new entry is made in the instruction thread set memory 205 for each group, in this example entry 1 and entry 2. Entry 1 includes an instruction pointer value of 32 (the next instruction 504 for that thread (sub)set), a thread set value of 1000 (indicating just a single thread in the thread (sub)set), a parent value of 0 (indicating the root thread set) and depth counter value of 0. Entry 2 has an instruction pointer value of 132 (the next instruction 505 for that thread (sub)set), a thread set value of 0111 (indicating three threads in the thread (sub)set), a parent pointer value of 0 (indicating the root thread (sub)set) and a depth counter value of 0. In embodiments, the parent pointer may be omitted. In embodiments, a child thread counter, counting the number of threads in the thread set of the parent node prior to divergence, may be provided. In this example, the child thread counter for entry 0 would be set to 4 (since the thread set of entry 0 was set to 1111 prior to divergence). The thread set value of the parent (the root thread set) is cleared to 0000 and the instruction pointer value of the parent is nulled.
(66) Now the scheduling logic 204 will cease to schedule the root thread set since it is no longer schedulable, but will now instead schedule both of the thread (sub)sets identified by entries 1 and 2 in the thread set memory 205. In fact the scheduling logic 204 is free to schedule any thread (sub)set which has no further sub-sets (which is analogous to leaf nodes as illustrated in
(67) Hence, the thread (sub)set identified by entry 1 and then the thread (sub)set identified by entry 2 in the instruction thread set memory 205 are scheduled so that thread (sub)set identified by entry 1 flows through the pipeline stages of lane 0, followed by thread (sub)set identified by entry 2 in another pipelined stage of lanes L1 to L3. In other words, the thread (sub)set identified by entry 1 is issued to lane 0 of stage S0 first and, in a subsequent cycle, the thread (sub)set identified by entry 2 is issued to lanes L1 to L3 of stage S0 so that the two (sub)sets are then concurrently processable (albeit at different stages of the pipeline). This should be contrasted to existing arrangements which would need to wait until thread (sub)set identified by entry 1 has reached a reconvergence point prior to the thread (sub)set identified by entry 2 being issued to the pipeline. This concurrent processing provides for significant performance improvements.
(68) The scheduling logic 204 continues to schedule instructions for the thread (sub)set identified by entry 1 to lanes L1 to L3 and for the thread (sub)set identified by entry 2.
(69) Potential Thread DivergenceNo Branch
(70) After a sequence of instructions (denoted by parallel lines in
(71) The occurrence of the control flow instruction 506, it is detected at step S30 and the instruction is continued to be executed at step S50.
(72) At step S60, it is determined that all threads in the (sub)set are at or are set to the same instruction pointer value.
(73) The scheduling logic 204 continues to schedule instructions for the thread (sub)set identified by entry 1 to lanes L1 to L3 and for the thread (sub)set identified by entry 2.
(74) Further Thread DivergenceCreation of Branches X1, Y and X
(75) When the thread (sub)set identified by entry 2 reaches the ENTER_REGION instruction 507, and this instruction is executed at step S20 by stage S2 of the execution unit 203, then this it is indicated to the control flow unit 202 which determines at step S30 that it has encountered a divergence/reconvergence-related instruction. In this example, it is determined that an ENTER_REGION instruction has been encountered and so at step S40, the value of the depth counter field 280 is increased by 1 to 2 and the instruction continues to be executed.
(76) The occurrence of the control flow instruction 508 is detected at step S30 and the instruction is continued to be executed at step S50.
(77) At step S60 it is determined that the instruction pointer value for the instruction within lane L1 diverges from the instruction pointer value for lane L2, which also diverges from the instruction pointer value for lane L3.
(78) At step S70, the thread for lane L1 is formed into one group, thread for lane L2 is formed into one group and the thread for lane L3 is formed into one group.
(79) Accordingly, at step S80, a new entry is made in the thread set memory 205 for each group, in this example entry 3, entry 4, and entry 5.
(80) Entry 3 includes an instruction pointer value of 254 (the next instruction 509 for that thread (sub)set), a thread set value of 0100 (indicating just a single thread in the thread (sub)set), a parent value of 2 and depth counter value of 0. Entry 4 has an instruction pointer value of 154 (the next instruction 604 for that thread (sub)set), a thread set value of 0010 (indicating just a single thread in the thread (sub)set), a parent pointer value of 2 and a depth counter value of 0. Entry 5 has an instruction pointer value of 330 (the next instruction 510 for that thread (sub)set), a thread set value of 0001 (indicating just a single thread in the thread (sub)set), a parent pointer value of 2 and a depth counter value of 0. The thread set value of the parent (the thread set identified by entry 2) is cleared to 0000 and the instruction pointer value of the parent is nulled. In this example, the child thread counter for entry 2 would be set to 3 (since the thread set of entry 2 was set to 0111 prior to divergence).
(81) Now the scheduling logic 204 will cease to schedule the root thread set and the thread (sub)set identified by entry 2 since these are no longer schedulable, but will now instead schedule all of the thread (sub)sets identified by entries 1, 3, 4 and 5 in the thread set memory 205 (the unblocked, leaf nodes).
(82) Hence, the scheduling logic 204 remains free to schedule the thread sub-set for branch 1 (identified by entry 1 in the thread set memory 201), branch X1 (identified by entry 3 in thread set memory 201), branch Y (identified by entry four in the thread set memory 201) and/or branch X (identified by entry 5 in the thread set memory 201).
(83) Branch Y Reconvergence
(84) When executing branch Y, an EXIT_REGION instruction 604 is identified by the control flow unit 202 at step S90. The control flow unit 202 determines from the thread set memory 205 that the depth counter value for the corresponding entry (entry 4) is set at 0. Accordingly, at step S100, entry 4 is deleted from the thread set memory 201 and the threads identified by the thread set value are set in the parent entry. In this example, the thread set value in entry 2 is set to 0010 to indicate that the thread being executed by lane L2 has reached a reconvergence point.
(85) At step S105 the content of the parent entry is considered (in this example, entry 2). At step S110 it is determined that there are still thread sets pointing to the parent entry (entry 2) and so the parent entry still has child nodes and so it is still not possible for the scheduling logic 204 to schedule that thread set (entry 2) for execution. This can be done by either checking the parent pointer value in every entry within the thread set memory to see if they refer to entry 2, by storing a mask which indicates the thread set value for this entry prior to any further divergence occurring (in this case 0111) or by comparing the number of threads that have reached a reconvergence point (i.e. threads 0010 indicated by the thread set value=1 thread) with the child node counter of the parent entry (which is set at 3).
(86)
(87) As can be seen from the thread set memory 205 and the schematic hierarchical node representation of the content of the thread set memory 205, the thread sets identified by entries 0 and 2 cannot be scheduled for processing and are blocked. The thread sets of entries 1, 3 and 5 have no other entries pointing to them since they are leaf nodes and so can be freely selected for scheduling by the scheduling logic 204.
(88) Accordingly, at step S10, the scheduling logic 204 schedules each thread set separately for concurrent execution within different pipeline stages of the execution unit 205. At step S20, each scheduled thread set is executed.
(89) Branch x1 Reconvergence
(90) At step S30, it is identified that the thread (sub)set identified by entry 3 reaches the EXIT_REGION instruction 604.
(91) The control flow unit 202 determines from the thread set memory 205 that the depth counter value for the corresponding entry (entry 3) is set at 0. Accordingly, at step S100, entry 3 is deleted from the thread set memory 201 and the threads identified by the thread set value are set in the parent entry. In this example, the thread set value in entry 2 is set to 0110 to indicate that the thread being executed by lane L1 has reached a reconvergence point.
(92) At step S105 the content of the parent entry is considered (in this example, entry 2). At step S110 it is determined that there are still thread sets pointing to the parent entry (entry 2) and so the parent entry still has child nodes and so it is still not possible for the scheduling logic 204 to schedule that thread set (entry 2) for execution. This can be done by either checking the parent pointer value in every entry within the thread set memory to see if they refer to entry 2, by storing a mask which indicates the thread set value for this entry prior to any further divergence occurring (in this case 0111) or by comparing the number of threads that have reached a reconvergence point (i.e. threads 0110 indicated by the thread set value=2 threads) with the child node counter of the parent entry (which is set at 3).
(93) Accordingly, the thread sets identified by entries 0 and 2 cannot be scheduled for processing and are blocked. The thread sets of entries 1 and 5 have no other entries pointing to them since they are leaf nodes and so can be freely selected for scheduling by the scheduling logic 204.
(94) Branch X Reconvergence
(95) At step S30, it is identified that the thread (sub)set identified by entry 5 reaches the EXIT_REGION instruction 604.
(96) The control flow unit 202 determines from the thread set memory 205 that the depth counter value for the corresponding entry (entry 5) is set at 0. Accordingly, at step S100, entry 5 is deleted from the thread set memory 201 and the threads identified by the thread set value are set in the parent entry. In this example, the thread set value in entry 2 is set to 0111 to indicate that the thread being executed by lane L3 has reached a reconvergence point.
(97) At step S105 the content of the parent entry is considered (in this example, entry 2). At step S110 it is determined that there are no thread sets pointing to the parent entry (entry 2) and so the parent entry has no child nodes. This can be done by either checking the parent pointer value in every entry within the thread set memory to see if they refer to entry 2, by storing a mask which indicates the thread set value for this entry prior to any further divergence occurring (again in this case 0111) or by comparing the number of threads that have reached a reconvergence point (i.e. threads 0111 indicated by the thread set value=3 threads) with the child node counter of the parent entry (which is set at 3).
(98) Accordingly, at step S120 the instruction pointer value of the parent entry (entry 2) is updated to indicate the instruction following the instruction pointer value which was stored by entry 5 (155) and the annotation is removed and processing proceeds to step S90.
(99) At step S90, the control flow unit 202 identifies that the depth counter value of the parent entry (entry 2) is set to 2 and so decreases the depth counter value to 1 at step S130.
(100) Accordingly, the control flow unit 202 is now free to schedule the thread sets identified by either or both entries 1 and 2.
(101) Potential Reconvergence
(102) During the execution of the thread set identified by entry 2, a further EXIT_REGION instruction 605a is identified at step S30.
(103) At step S90, the control flow unit 202 identifies that the depth counter value is set to 1 and so decreases the depth counter value for entry 2 to 0 at step S130. The presence of the depth counter value merely indicated that a potential divergence was identified previously by the ENTER_REGION instruction 505a but that no actual divergence occurred. Accordingly, the presence of the EXIT_REGION instruction 605a cannot be considered to indicate a reconvergence until the counter has been restored to its initial value of 0.
(104) The control flow unit 202 is still free to schedule the thread sets identified by either or both entries 1 and 2.
(105) Branch 2 Reconvergence
(106) The control flow unit 202 continues to schedule the thread set identified by entry 2 until the EXIT_REGION instruction 608 is encountered. At this stage, the control flow unit 202 identifies that the depth counter for the thread set associated with entry 2 has a value of 0. Accordingly, the entry is deleted and any bits in the thread set value which are set at 1 are also set to 1 in the entry identified by the parent pointer (in this case entry 0). Accordingly, entry 0 now has a thread set value of 0111.
(107) At step S105 the content of the parent entry is considered (in this example, entry 0). At step S110 it is determined that there are still thread sets pointing to the parent entry (entry 0) and so the parent entry still has child nodes and so it is still not possible for the scheduling logic 204 to schedule that thread set (entry 0) for execution. This can be done by either checking the parent pointer value in every entry within the thread set memory to see if they refer to entry 0, by storing a mask which indicates the thread set value for this entry prior to any further divergence occurring (in this case 1111) or by comparing the number of threads that have reached a reconvergence point (i.e. threads 0111 indicated by the thread set value=3 threads) with the child node counter of the parent entry (which is set at 4).
(108) Branch 1 ReconvergenceRoot Thread Set Reformation
(109) Accordingly, the schedule logic 204 can only continue to schedule the thread set associated with entry 1 until the EXIT_REGION instruction 608 is encountered. At this point, the control flow unit 202 identifies at step S90 that the depth counter value for entry 1 is 0 and so sets any bits of the thread set value for entry 1 which are set to 1 to also be 1 in the thread set value of entry 0. Accordingly, the thread set value for entry 0 is now 1111 and entry 1 is deleted.
(110) At step S105 the content of the parent entry is considered (in this example, entry 0).
(111) The control flow unit 202 determines that no other entries within the thread set memory 201 point to entry 0 either by checking the parent pointer value of every entry to see if they refer to entry 0, by comparing the thread set value for entry 0 with a thread set mask value indicating the value of the thread set when the entry was first made (in this case also having the value 1111) or by comparing the number of threads that have reached a reconvergence point (i.e. threads 1111 indicated by the thread set value=4 threads) with the child node counter of the parent entry (which is set at 4). The control flow unit 202 updates the instruction pointer value for entry 0 to be the instruction after that which was held by entry 1 (in this case the value 191) and removes the annotation that the instructions associated with entry 0 cannot be scheduled for processing.
(112) At step S90 it is determined that the depth counter for entry 0 is set at 1 and so it is decreased to 0 at step S130.
(113) Accordingly, it can be seen that the root thread set has been reconverged and that the scheduler is now able to schedule this thread set when desired. Also, it can be seen that through this approach it is possible to schedule any thread set which is not pointed to by any other thread set for execution simultaneously within the pipeline stages along with any other such thread set. This allows for the simultaneous execution of such leaf node thread sets, which increases the performance of the processor significantly.
(114) Unstructured Code Operation
(115) When processing potentially unstructured code, STORE_REGION and RESTORE_REGION instructions will have been inserted into the instruction set, as mentioned above.
(116) At step S200, the execution of a STORE_REGION instruction is identified.
(117) At step S210, the content of the entry for the thread set currently being executed which encounters the STORE_REGION instruction is stored in another location. For example, the content of the entry can be stored in another part of the thread set memory 205 or stored in another memory.
(118) At step S220 processing of the STORE_REGION instruction is complete. Accordingly, it can be seen that the state of the entry for the thread set is stored in memory for later use.
(119)
(120) At step S230, the RESTORE_REGION instruction is reported.
(121) At step S240, the content of the entry for the thread set being executed which encountered the RESTORE_REGION instruction is loaded from the location it was stored at as a result of the corresponding STORE_REGION instruction.
(122) At step S250, it is determined whether the threads in the thread set currently being executed which encountered the RESTORE_REGION instruction match the threads in the thread set that is being loaded from the location mentioned above. For example, if the thread set field 260 of the entry loaded from the location had the value 0111, then the control flow unit 202 will determine whether the value of the thread set field 260 for the thread set which encountered the RESTORE_REGION instruction also has the value of 0111. If the two thread set fields 260 match, then processing proceed to step S260 where the current entry in the thread set memory 205 is overwritten with all the content of the entry retrieved from the location mentioned above, but the instruction pointer value remains unchanged. Hence, this restores the table entry back to the state it was when the STORE_REGION instruction was encountered, but retains the current position in the instruction stream.
(123) If it is determined at step S250 that not all of the threads of thread set that is being loaded from the location mentioned above are being executed, then processing proceeds to step S270. For example, this would happen if the thread set field 260 of the entry loaded from the location mentioned had the value 0111 but the thread set field 260 for the thread set which encountered the RESTORE_REGION instruction has the different value of 0001.
(124) Accordingly, at step S270, the entry in the thread set memory 205 in which the thread set field 260 matches the thread set field 260 of the entry loaded from the location is identified. From this identified entry, the threads in the thread set field 260 of the executing thread set are removed. For example, in the above-mentioned case with the thread set field 260 loaded from the location having a value of 0111 and the thread set field 260 from the executing threads having a value of 0001, the entry is identified in the thread set memory having a thread set field 260 matching 0111. From this identified entry the reference to the executing thread set, represented by 0001, is removed from its thread set field 260. Accordingly, the thread set field 260 of the identified entry is changed from 0111 to 0110 in order to remove the reference to the executing thread set 0001.
(125) The reference to the executing threads set (i.e. 0001) is also removed from all sub-sets of the identified thread set memory entry, excluding the thread set executing the RESTORE_REGION instruction.
(126) The current entry in the thread set memory 205 of the thread set executing the RESTORE_REGION instruction is overwritten with all the content of the entry loaded from the aforementioned location, but the instruction pointer value and thread set entry 260 remain unchanged. This restores the table entry back to the state it was when the STORE_REGION instruction was encountered, but only for the threads executing RESTORE_REGION, and retains the current position in the instruction stream. This effectively decouples the threads executing RESTORE_REGION from other thread sets and thereby enables restoring the stored region state without interfering with the reconvergence of these other threads.
(127) At step S280, processing of the RESTORE_REGION instruction completes.
Other Embodiments
(128) In one embodiment, those threads with the highest depth counter are scheduled in preference to those with a lower depth counter. In those embodiments where the parent pointer is omitted then the depth counter may be inherited by a child node from the parents node and then incremented. Again, with this arrangement the threads with the higher depth counter are scheduled in preference to those with a lower depth counter.
(129) In one embodiment, a parent-child relationship may be determined such that a child's parent is the thread set that originally contained all of the child's threads and the fewest additional threads. This requires each entry in the thread set memory 205 to store the thread set field 260 prior to further divergence occurring. This can be in the form of a mask which indicates the thread set value for this entry prior to any further divergence occurring, the pre-divergence mask. For example, when determining the parent for the executing thread set 0001 with three other entries in the thread set memory 205 with the pre-divergence masks of 1000, 0111 and 1111, the entry with pre-divergence mask 0111 is the parent as it contains the thread set 0001 and the fewest additional threads (i.e. 0111 contains 0001 and 0111 contains two additional threads).
(130) In one embodiment, at least some of the content of the thread set memory may be stored within the pipeline stages of the execution unit 203 itself.
(131) In one embodiment, a region ID is stored on each divergence. With this approach, EXIT_REGION instructions where the executing thread set depth counter is set to 0 are only executed if the EXIT_REGION instruction has a matching ID.
(132) In one embodiment, a block bit field is provided within the thread set memory to indicate that a thread set is not schedulable.
(133) Although illustrative embodiments of the invention have been disclosed in detail herein, with reference to the accompanying drawings, it is understood that the invention is not limited to the precise embodiment and that various changes and modifications can be effected therein by one skilled in the art without departing from the scope of the invention as defined by the appended claims and their equivalents.
(134) Aspects and embodiments of the invention are set out in the following numbered paragraphs. It is to be understood that the invention encompasses these aspects
(135) Paragraph 1. A data processing apparatus comprising: a. a plurality of parallel lanes for parallel processing of sets of threads, each lane comprising a plurality of pipelined stages, said pipelined stages of each lane being operable to process instructions from said sets of threads; and b. scheduling logic operable to schedule instructions for processing by said lanes, said scheduling logic being operable to identify that one of said sets of threads being processed is to be split into a plurality of sub-sets of threads and to schedule at least two of said plurality of sub-sets of threads for processing by different pipelined stages concurrently.
(136) Paragraph 2. The data processing apparatus of paragraph 1, wherein each thread within said set of threads scheduled for processing shares a common instruction pointer value identifying at least one common instruction for parallel processing by said lanes.
(137) Paragraph 3. The data processing apparatus of paragraph 1 or 2, wherein said scheduling logic comprises storage operable to store an indication of said set of threads and each of said sub-sets of threads.
(138) Paragraph 4. The data processing apparatus of paragraph 3, wherein said indication of said set of threads and each of said sub-sets of threads comprises an instruction pointer value associated with each thread.
(139) Paragraph 5. The data processing apparatus of any preceding paragraph, wherein said scheduling logic is operable to identify that said set of threads is to be split into said plurality of sub-sets of threads when different instructions are identified for parallel processing by the same stage within said lanes.
(140) Paragraph 6. The data processing apparatus of any preceding paragraph, wherein said scheduling logic is operable to identify that said set of threads is to be split into said plurality of sub-sets of threads when a divergence in instruction pointer value occurs for at least one thread of said set of threads.
(141) Paragraph 7. The data processing apparatus of any preceding paragraph, wherein said scheduling logic is operable to include, in each sub-set of threads, those threads sharing a common instruction pointer value identifying at least one common instruction for parallel processing by said lanes.
(142) Paragraph 8. The data processing apparatus of any preceding paragraph, wherein said scheduling logic is operable to identify that a possible divergence in instruction pointer value may occur which is identified by a divergence indicator within a thread.
(143) Paragraph 9. The data processing apparatus of paragraph 8, wherein said scheduling logic is operable to perform an adjustment of a reconvergence counter for said set of threads from an initial value on an occurrence of said divergence indicator.
(144) Paragraph 10. The data processing apparatus of any preceding paragraph, wherein said scheduling logic is operable to identify recursively that one of said sub-sets of threads being processed is to be split into a further plurality of sub-sets of threads and to schedule at least two sub-sets of threads for processing by different pipelined stages concurrently.
(145) Paragraph 11. The data processing apparatus of any preceding paragraph, wherein said scheduling logic is operable to schedule any at least two sub-sets of threads which have not themselves been split recursively into further sub-sets for processing by different pipelined stages concurrently.
(146) Paragraph 12. The data processing apparatus of any preceding paragraph, wherein said scheduling logic is operable to prevent said set of threads from being scheduled for processing until said plurality of sub-sets of threads have reconverged again to share a common instruction pointer value.
(147) Paragraph 13. The data processing apparatus of any one of paragraphs 10 to 12, wherein said scheduling logic is operable to reform a sub-set of threads when every further sub-set split from that sub-set of threads has reconverged again to share a common instruction pointer value.
(148) Paragraph 14. The data processing apparatus of any one of paragraphs 9 to 13, wherein said scheduling logic is operable to reverse said adjustment of said reconvergence counter for said sub-set of threads when a possible reconvergence identified by a reconvergence indicator occurs within that sub-set of threads.
(149) Paragraph 15. The data processing apparatus of any one of paragraphs 9 to 14, wherein said scheduling logic is operable to determine that a sub-set of threads has reached a possible reconvergence when said reconvergence counter returns to said initial value.
(150) Paragraph 16. The data processing apparatus of any one of paragraphs 8 to 15, comprising logic operable to annotate an instruction stream comprising said instructions to provide said divergence and reconvergence indicators.
(151) Paragraph 17. The data processing apparatus of any one of paragraphs 8 to 16, wherein said divergence and reconvergence indicators identify at least one of a single-entry single-exit region and a region of unstructured code.
(152) Paragraph 18. The data processing apparatus of any one of paragraphs 8 to 17, wherein said scheduling logic is operable, on occurrence of a store indicator within a set of threads, to store in alternative storage contents of storage associated with said set of threads.
(153) Paragraph 19. The data processing apparatus of any one of paragraphs 8 to 18, wherein said scheduling logic is operable, on occurrence of a restore indicator, to determine whether said set of threads within which said restore indicator occurred matches said set of threads whose contents are stored in said alternative storage and, if so, to overwrite contents in said storage associated with said set of threads with said contents from said alternative storage except for said instruction pointer value.
(154) Paragraph 20. The data processing apparatus of any one of paragraphs 8 to 19, wherein said scheduling logic is operable, on occurrence of a restore indicator, to determine whether said set of threads within which said restore indicator occurred matches said set of threads whose contents are stored in said alternative storage and, if not, to remove an indication of said set of threads within which said restore indicator occurred from said contents and from contents associated with sub-sets of threads of said set of threads within which said reconvergence restore indicator occurred prior to overwriting contents in said storage associated with said set of threads with said contents from said alternative storage except for said instruction pointer value and said indication of said set of threads.
(155) Paragraph 21. A data processing method of scheduling instructions for processing by a data processing apparatus comprising a plurality of parallel lanes for parallel processing of sets of threads, each lane comprising a plurality of pipelined stages, said pipelined stages of each lane being operable to process instructions from said sets of threads, said method comprising: c. identifying that one of said sets of threads being processed is to be split into a plurality of sub-sets of threads; and d. scheduling at least two of said plurality of sub-sets of threads for processing by different pipelined stages concurrently.
(156) Paragraph 22. A computer program product operable, when executed on a computer, to perform the method steps of paragraph 21.