OPTIMIZATION OF CAPTURED LOOPS IN A PROCESSOR FOR OPTIMIZING LOOP REPLAY PERFORMANCE
20230205535 · 2023-06-29
Inventors
Cpc classification
International classification
Abstract
Optimization of captured loops in a processor for optimizing loop replay performance, and related methods and computer-readable media are disclosed. The processor includes a loop buffer circuit configured to detect loops. In response to a detected loop, the loop buffer circuit is configured to capture loop instructions in the detected loop and replay the captured loop instructions in the instruction pipeline to be processed and executed for subsequent iterations of the loop. The loop buffer circuit is configured to determine if loop optimizations are available to be made based on a captured loop to enhance performance of loop replay. If the loop buffer circuit determines loop optimizations are available to be made based on a captured loop, the loop buffer circuit is configured to perform such loop optimizations so that such loop optimizations can be realized when the captured loop is replayed to enhance replay performance of the captured loop.
Claims
1. A processor comprising, an instruction processing circuit configured to process an instruction stream comprising a plurality of instructions in an instruction pipeline; and a loop buffer circuit configured to: detect a loop comprising a plurality of loop instructions among the plurality of instructions in the instruction stream; in response to detection of the loop in the instruction stream: capture the plurality of loop instructions of the detected loop as a captured loop; determine, based on the captured loop, if a loop optimization is available to be made for the captured loop; and in response to determining the loop optimization is available to be made for the captured loop, modify the captured loop to produce an optimized loop; determine if the captured loop is to be replayed in the instruction pipeline; and in response to determining the captured loop is to be replayed in the instruction pipeline, insert the optimized loop in the instruction pipeline to be replayed.
2. The processor of claim 1, wherein the loop buffer circuit comprises: a loop detection circuit configured to detect the loop comprising the plurality of loop instructions among the plurality of instructions in the instruction stream in the instruction pipeline to be executed; a loop capture circuit configured to capture the plurality of loop instructions of the detected loop as the captured loop; a loop optimization circuit configured to: determine if the loop optimization is available to be made for the captured loop, based on the captured loop; and in response to determining the loop optimization is available to be made for the captured loop, modify the captured loop to produce the optimized loop; and a loop replay circuit configured to, in response to determining the captured loop is to be replayed in the instruction pipeline, insert the optimized loop in the instruction pipeline to be replayed.
3. The processor of claim 1, further comprising a loop buffer memory comprising a plurality of instruction entries each configured to store an instruction among the plurality of instructions; wherein the loop buffer circuit is configured to: capture the plurality of loop instructions of the detected loop as the captured loop by being configured to: store each loop instruction among the plurality of loop instructions in an instruction entry among the plurality of instructions entries in the loop buffer memory; determine if the loop optimization is available to be made based on the captured loop by being configured to: access the plurality of loop instructions for the captured loop in the plurality of instruction entries in the loop buffer memory; and determine, based on the accessed plurality of loop instructions for the captured loop in the plurality of instruction entries in the loop buffer memory, if the loop optimization is available to be made for the captured loop; in response to determining the loop optimization is available to be made for the captured loop, modify at least one instruction entry among the plurality of instruction entries in the loop buffer memory for the captured loop to produce the optimized loop; and in response to determining the captured loop is to be replayed in the instruction pipeline, insert the optimized loop from the loop buffer memory in the instruction pipeline to be replayed.
4. The processor of claim 1, wherein the loop buffer circuit is configured to: determine if the loop optimization is available to be made for the captured loop, based on the captured loop by being configured to: determine if at least one loop instruction among the plurality of loop instructions in the captured loop can be transformed while maintaining the same function of the at least one loop instruction when executed; and in response to determining the at least one loop instruction among the plurality of loop instructions in the captured loop can be transformed while maintaining the same function of the at least one loop instruction when executed, transform the at least one loop instruction among the plurality of loop instructions in the captured loop to produce the optimized loop.
5. The processor of claim 3, wherein the loop optimization circuit is configured to: determine if the loop optimization is available to be made for the captured loop by being configured to determine if at least one loop instruction among the plurality of loop instructions in the captured loop can be transformed while maintaining the same function of the at least one loop instruction when executed; and in response to determining at least one loop instruction among the plurality of loop instructions in the captured loop can be transformed while maintaining the same function of the at least one loop instruction when executed, modify the at least one instruction entry among the plurality of instruction entries in the loop buffer memory to produce the optimized loop.
6. The processor of claim 4, wherein the loop buffer circuit is configured to: determine if the at least one loop instruction among the plurality of loop instructions in the captured loop can be transformed by being configured to determine if at least two loop instructions among the plurality of loop instructions in the captured loop can be fused into at least one fused instruction that has the same function of the at least two loop instructions when executed; and in response to determining the at least two loop instructions among the plurality of loop instructions can be fused into the at least one fused instruction that has the same function of the at least two loop instructions when executed, fuse the at least two loop instructions among the plurality of loop instructions in the captured loop to produce the optimized loop.
7. The processor of claim 4, wherein the loop buffer circuit is configured to: determine if the at least one loop instruction among the plurality of loop instructions in the captured loop can be transformed by being configured to determine if at least one loop instruction among the plurality of loop instructions in the captured loop can be fused with itself in the captured loop when the captured loop is executed in at least one subsequent iteration of the captured loop; and in response to determining the at least one loop instruction among the plurality of loop instructions in the captured loop can be fused with itself in the captured loop when the captured loop is executed in at least one subsequent iteration of the captured loop, identify the at least one loop instruction among the plurality of loop instructions in the captured loop to not be replayed on at least one subsequent iteration of the execution of captured loop to produce the optimized loop.
8. The processor of claim 4, wherein the loop buffer circuit is configured to: determine if the at least one loop instruction among the plurality of loop instructions in the captured loop can be transformed by being configured to determine if the at least one loop instruction among the plurality of loop instructions in the captured loop is loop invariant to the captured loop; and in response to determining the at least one loop instruction among the plurality of loop instructions in the captured loop is loop invariant to the captured loop, remove the at least one loop instruction among the plurality of loop instructions determined to be loop invariant from the captured loop to produce the optimized loop.
9. The processor of claim 4, wherein the loop buffer circuit is configured to: determine if the at least one loop instruction among the plurality of loop instructions in the captured loop can be transformed by being configured to determine if the at least one loop instruction among the plurality of loop instructions in the captured loop can be modified to at least one alternative instruction with the same function as the at least one loop instruction and executed in less clock cycles than the at least one loop instruction; and in response to determining the at least one loop instruction among the plurality of loop instructions in the captured loop can be modified to at least one alternative instruction with the same function as the at least one loop instruction and can be executed in less clock cycles than the at least one loop instruction, transform the at least one loop instruction among the plurality of loop instructions in the captured loop to the at least one alternative instruction to produce the optimized loop.
10. The processor of claim 4, wherein the loop buffer circuit is configured to: determine if the at least one loop instruction among the plurality of loop instructions in the captured loop can be transformed by being configured to determine if the at least one loop instruction among the plurality the loop instructions in the captured loop is a critical instruction; and in response to determining the at least one loop instruction among the plurality of loop instructions in the captured loop is a critical instruction, set a scheduling priority indicator associated with the critical instruction to cause the critical instruction to be scheduled for execution at a higher priority in the instruction pipeline when the optimized loop is inserted in the instruction pipeline to be replayed as the optimized loop.
11. The processor of claim 4, further comprising a loop buffer memory comprising a plurality of instructions entries each configured to store an instruction among the plurality of instruction, each instructions entry among the plurality of instructions entries comprising a scheduling priority indicator; wherein the loop buffer circuit is configured to: capture the plurality of loop instructions of the detected loop as the captured loop by being configured to: store each loop instruction among the plurality of loop instructions in an instruction entry among the plurality of instructions entries in the loop buffer memory; determine if the at least one loop instruction among the plurality the loop instructions in the captured loop is a critical instruction by being configured to: access the plurality of loop instructions for the captured loop in the plurality of instruction entries in the loop buffer memory; and determine, based on the accessed plurality of loop instructions for the captured loop in the plurality of instruction entries in the loop buffer memory, if the instruction among the plurality of loop instructions for the captured loop is the critical instruction; and in response to determining the instruction among the plurality of loop instructions for the captured loop is the critical instruction, set the scheduling priority indicator in the instruction entry associated with the critical instruction among the plurality of instruction entries in the loop buffer memory to cause the critical instruction to be scheduled for execution at a higher priority in the instruction pipeline when the optimized loop is inserted in the instruction pipeline to be replayed as the optimized loop.
12. The processor of claim 10, wherein the loop buffer circuit is configured to determine if the at least one loop instruction among the plurality the loop instructions in the captured loop is a critical instruction, by being configured to determine if the at least one loop instruction among the plurality the loop instructions in the captured loop is a critical load instruction.
13. The processor of claim 10, wherein the loop buffer circuit is configured to determine if the at least one loop instruction among the plurality the loop instructions in the captured loop is a critical instruction, by being configured to determine if the at least one loop instruction among the plurality the loop instructions in the captured loop is an unlocking instruction.
14. The processor of claim 1, wherein the loop buffer circuit is configured to: determine if the loop optimization is available to be made for the captured loop, based on the captured loop by being configured to: determine if an instruction execution slice is present among the plurality of loop instructions in the captured loop; and in response to determining the instruction execution slice is present among the plurality of loop instructions in the captured loop, create the optimized loop by being configured to: identify the instruction execution slice among the plurality of loop instructions in the captured loop; and in response to determining the captured loop is to be replayed in the instruction pipeline, insert the optimized loop in the instruction pipeline to be replayed by being configured to: create at least one pre-fetch instruction representing the identified instruction execution slice in the captured loop; insert the at least one pre-fetch instruction in a pre-fetch stage in the instruction pipeline to be executed; and insert the other plurality of instructions in optimized loop not identified as the instruction execution slice in the instruction pipeline to be executed.
15. The processor of claim 14, further comprising a loop buffer memory comprising a plurality of instructions entries each configured to store an instruction among the plurality of instructions; wherein the loop buffer circuit is configured to: capture the plurality of loop instructions of the detected loop as the captured loop by being configured to: store each loop instruction among the plurality of loop instructions in an instruction entry among the plurality of instructions entries in the loop buffer memory; determine if the instruction execution slice is present among the plurality of loop instructions in the captured loop by being configured to: access the plurality of loop instructions for the captured loop in the plurality of instruction entries in the loop buffer memory; and determine, based on the accessed plurality of loop instructions for the captured loop in the plurality of instruction entries in the loop buffer memory, if the instruction execution slice is present among the plurality of loop instructions in the captured loop in the loop buffer memory.
16. The processor of claim 15, wherein: each instruction entry among the plurality of instruction entries in the loop buffer entry comprises a pointer field configured to store a pointer; and the loop buffer circuit is configured to: in response to determining the instruction execution slice is present among the plurality of loop instructions in the captured loop in the loop buffer memory, create the optimized loop by being configured to: identify the instruction execution slice among the plurality of loop instructions in the captured loop to create the optimized loop, by being configured to set a pointer in a pointer field in at least one instruction entry among the plurality of instruction entries in the loop buffer memory associated with the instruction execution slice; and in response to determining the captured loop is to be replayed in the instruction pipeline, insert the optimized loop in the instruction pipeline to be replayed by being configured to: create at least one pre-fetch instruction representing the instruction execution slice in the captured loop based on accessing a pointer in a pointer field for at least one instruction of the instruction execution slice in the at least one instruction entry among the plurality of instruction entries in the loop buffer memory; insert the at least one pre-fetch instruction in a pre-fetch stage in the instruction pipeline to be executed; and insert the other plurality of instructions in the optimized loop not identified as the instruction execution slice in the instruction pipeline to be executed.
17. The processor of claim 14, wherein the loop buffer circuit is further configured to: determine if the captured loop is to be replayed in the instruction pipeline in a regular replay mode; and in response to determining the captured loop is to be replayed in the instruction pipeline in a regular replay mode: insert the optimized loop in the instruction pipeline to be replayed.
18. The processor of claim 14, wherein the instruction processing circuit is further configured to execute the inserted at least one pre-fetch instruction in the instruction pipeline as at least one non-architectural instruction.
19. The processor of claim 1, wherein the instruction processing circuit further comprises: an instruction fetch circuit configured to fetch the plurality of instructions into the instruction pipeline as the instruction stream to be executed; and an execution circuit configured to execute the plurality of instructions in the instruction stream.
20. A method of replaying an optimized loop based on a captured loop in an instruction pipeline in a processor, comprising: detecting a loop comprising a plurality of loop instructions among the plurality of instructions in an instruction stream comprising a plurality of instructions in an instruction pipeline; in response to detection of the loop in the instruction stream: capturing the plurality of loop instructions of the detected loop as a captured loop; determining, based on the captured loop, if a loop optimization is available to be made for the captured loop; and modifying the captured loop to produce an optimized loop, in response to determining the loop optimization is available to be made for the captured loop; determining if the captured loop is to be replayed in the instruction pipeline; and inserting the optimized loop in the instruction pipeline to be replayed, in response to determining the captured loop is to be replayed in the instruction pipeline.
21. The method of claim 20, wherein capturing the plurality of loop instructions of the detected loop as the captured loop comprises store each loop instruction among the plurality of loop instructions in an instruction entry among a plurality of instructions entries in a loop buffer memory; determining if the loop optimization is available to be made based on the captured loop comprising: accessing the plurality of loop instructions for the captured loop in the plurality of instruction entries in the loop buffer memory; and determining, based on the accessed plurality of loop instructions for the captured loop in the plurality of instruction entries in the loop buffer memory, if the loop optimization is available to be made for the captured loop; modifying at least one instruction entry among the plurality of instruction entries in the loop buffer memory for the captured loop to produce the optimized loop, in response to determining the loop optimization is available to be made for the captured loop; and inserting the optimized loop from the loop buffer memory in the instruction pipeline to be replayed, in response to determining the captured loop is to be replayed in the instruction pipeline.
22. The method of claim 20, wherein: determining if the loop optimization is available to be made for the captured loop, based on the captured loop comprises: determining if at least one loop instruction among the plurality of loop instructions in the captured loop can be transformed while maintaining the same function of the at least one loop instruction when executed; and modifying at least one instruction entry among the plurality of instruction entries in the loop buffer memory for the captured loop to produce the optimized loop comprises transforming the at least one loop instruction among the plurality of loop instructions in the captured loop to produce the optimized loop, in response to determining the at least one loop instruction among the plurality of loop instructions in the captured loop can be transformed while maintaining the same function of the at least one loop instruction when executed.
23. The method of claim 22, wherein: determining if the at least one loop instruction among the plurality of loop instructions in the captured loop can be transformed comprises determining if at least two loop instructions among the plurality of loop instructions in the captured loop can be fused into at least one fused instruction that has the same function of the at least two loop instructions when executed; and transforming the at least one loop instruction among the plurality of loop instructions in the captured loop to produce the optimized loop comprises fusing the at least two loop instructions among the plurality of loop instructions in the captured loop to produce the optimized loop, in response to determining the at least two loop instructions among the plurality of loop instructions can be fused into the at least one fused instruction that has the same function of the at least two loop instructions when executed.
24. The method of claim 22, wherein: determining if the at least one loop instruction among the plurality of loop instructions in the captured loop can be transformed comprises determining if at least one loop instruction among the plurality of loop instructions in the captured loop can be fused with itself in the captured loop when the captured loop is executed in at least one subsequent iteration of the captured loop; and transforming the at least one loop instruction among the plurality of loop instructions in the captured loop to produce the optimized loop comprises identifying the at least one loop instruction among the plurality of loop instructions in the captured loop to not be replayed on at least one subsequent iteration of the execution of captured loop to produce the optimized loop, in response to determining the at least one loop instruction among the plurality of loop instructions in the captured loop can be fused with itself in the captured loop when the captured loop is executed in at least one subsequent iteration of the captured loop.
25. The method of claim 22, wherein: determining if the at least one loop instruction among the plurality of loop instructions in the captured loop can be transformed comprises determining if the at least one loop instruction among the plurality of loop instructions in the captured loop is loop invariant to the captured loop; and transforming the at least one loop instruction among the plurality of loop instructions in the captured loop to produce the optimized loop comprises removing the at least one loop instruction among the plurality of loop instructions determined to be loop invariant from the captured loop to produce the optimized loop, in response to determining the at least one loop instruction among the plurality of loop instructions in the captured loop is loop invariant to the captured loop.
26. The method of claim 22, wherein: determining if the at least one loop instruction among the plurality of loop instructions in the captured loop can be transformed comprises determining if the at least one loop instruction among the plurality of loop instructions in the captured loop can be modified to at least one alternative instruction with the same function as the at least one loop instruction and executed in less clock cycles than the at least one loop instruction; and transforming the at least one loop instruction among the plurality of loop instructions in the captured loop to produce the optimized loop comprises transforming the at least one loop instruction among the plurality of loop instructions in the captured loop to the at least one alternative instruction to produce the optimized loop, in response to determining the at least one loop instruction among the plurality of loop instructions in the captured loop can be modified to at least one alternative instruction with the same function as the at least one loop instruction and can be executed in less clock cycles than the at least one loop instruction.
27. The method of claim 22, wherein the loop buffer circuit is configured to: determining if the at least one loop instruction among the plurality of loop instructions in the captured loop can be transformed comprises determining if the at least one loop instruction among the plurality the loop instructions in the captured loop is a critical instruction; and transforming the at least one loop instruction among the plurality of loop instructions in the captured loop to produce the optimized loop comprises setting a scheduling priority indicator associated with the critical instruction to cause the critical instruction to be scheduled for execution at a higher priority in the instruction pipeline when the optimized loop is inserted in the instruction pipeline to be replayed as the optimized loop, in response to determining the at least one loop instruction among the plurality of loop instructions in the captured loop is a critical instruction.
28. The method of claim 20, wherein: determining if the loop optimization is available to be made for the captured loop, based on the captured loop comprises determining if an instruction execution slice is present among the plurality of loop instructions in the captured loop; modifying the captured loop to produce the optimized loop comprises identifying the instruction execution slice among the plurality of loop instructions in the captured loop, in response to determining the instruction execution slice is present among the plurality of loop instructions in the captured loop; and in response to determining the captured loop is to be replayed in the instruction pipeline, inserting the optimized loop in the instruction pipeline to be replayed by: creating at least one pre-fetch instruction representing the identified instruction execution slice in the captured loop; inserting the at least one pre-fetch instruction in a pre-fetch stage in the instruction pipeline to be executed; and inserting the other plurality of instructions in optimized loop not identified as the instruction execution slice in the instruction pipeline to be executed.
29. The method of claim 28, wherein: determining if the captured loop is to be replayed in the instruction pipeline comprises determining if the captured loop is to be replayed in the instruction pipeline in a regular replay mode; and comprising inserting the optimized loop in the instruction pipeline to be replayed, in response to determining the captured loop is to be replayed in the instruction pipeline in the regular replay mode.
30. A non-transitory computer-readable medium having stored thereon computer executable instructions which, when executed by a processor, cause the processor to replay an optimized loop based on a captured loop in an instruction pipeline in a processor, by causing the processor to: detect a loop comprising a plurality of loop instructions among the plurality of instructions in an instruction stream comprising a plurality of instructions in an instruction pipeline; in response to detection of the loop in the instruction stream: capture the plurality of loop instructions of the detected loop as a captured loop; determine, based on the captured loop, if a loop optimization is available to be made for the captured loop; and modify the captured loop to produce an optimized loop, in response to determining the loop optimization is available to be made for the captured loop; determine if the captured loop is to be replayed in the instruction pipeline; and insert the optimized loop in the instruction pipeline to be replayed, in response to determining the captured loop is to be replayed in the instruction pipeline.
Description
BRIEF DESCRIPTION OF THE DRAWING FIGURES
[0015] The accompanying drawing figures incorporated in and forming a part of this specification illustrate several aspects of the disclosure, and together with the description serve to explain the principles of the disclosure.
[0016]
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
[0029]
[0030]
[0031]
DETAILED DESCRIPTION
[0032] Aspects disclosed herein include optimization of captured loops in a processor for optimizing loop replay performance. Related methods and computer-readable media are also disclosed. The processor includes an instruction processing circuit configured to fetch computer program instructions (“instructions”) into an instruction stream in an instruction pipeline(s) to be processed and executed. Loops can be contained in the instruction stream. A loop is a sequence of instructions in the instruction stream that repeats sequentially in a back-to-back arrangement. The instruction processing circuit includes a loop buffer circuit that is configured to detect loops. In response to a detected loop, the loop buffer circuit is configured to capture loop instructions in the detected loop and insert (i.e., replay) the captured loop instructions in the instruction pipeline to be processed and executed for subsequent iterations of the loop. In this manner, the instructions in the loop may have not have to be re-fetched and re-processed, for example, for the subsequent iterations of the loop. In exemplary aspects, the loop buffer circuit is also configured to determine if loop optimizations are available to be made based on a captured loop to enhance performance of the replay of the loop, and to perform such loop optimizations if available. Because the captured loop may contain more instructions for a captured loop than would otherwise be present in the instruction pipeline or a particular pipeline stage for processing at a given time, the processor can use this enhanced visibility of a larger number of instructions in a loop captured in the loop buffer circuit to determine loop optimizations for the loop. These loop optimizations may not be possible to determine otherwise at compile time and/or at run-time based only on the knowledge of the presence of certain instructions of the loop within the instruction pipeline. In this regard, if the loop buffer circuit determines that, if loop optimizations are available to be made based on a captured loop, the loop buffer circuit is configured to modify at least one instruction in the captured loop to produce an optimized loop. The optimized loop can then be replayed in the instruction pipeline when the loop is to be re-processed and re-executed in the instruction pipeline in an iteration(s) so that the loop optimization is realized by the processor.
[0033]
[0034] The processor 200 in
[0035] With continued reference to the processor 200 in
[0036] With continuing reference to
[0037] The instructions 206 in the instruction stream 208 may contain loops. A loop is a sequence of instructions 206 in the instruction stream 208 that repeat (i.e., processed) sequentially in a back-to-back arrangement. A loop can be present in the instruction stream 208 as a result of a programmed software construct that is compiled into a loop among the instructions 206. A loop can also be present in the instruction stream 208 even if not part of a higher-level, programmed, software construct, such as based on binary instructions resulting from compiling of a higher-level, programmed, software construct. If the instructions 206 that are part of a loop could be detected when such instructions 206 are processed in an instruction pipeline I.sub.0-I.sub.N, these instructions 206 could be captured and replayed into the instruction stream 208 in processing stages in an instruction pipeline I.sub.0-I.sub.N without having to re-fetch and/or re-decode such instructions 206, for example, for the subsequent iterations of the loop. Note that a loop can include further internal loops. Thus, a sequence of instructions 206 that is detected and captured as a captured loop can capture one path of a loop and thus appear to be a branch-free loop body that does not have further internal branches. For example, if loop has alternating conditions of branch taken and not taken, two (2) loops can be captured to represent the overall loop.
[0038] In this regard, the instruction processing circuit 204 in this example includes the loop buffer circuit 210 to perform loop buffering. As discussed in more detail below, the loop buffer circuit 210 is configured to detect a loop in instructions 206 fetched into an instruction pipeline I.sub.0-I.sub.N as an instruction stream 208 to be processed and executed. The loop buffer circuit 210 is configured to detect loops among the instructions 206 in the instruction stream 208. In response to a detected loop, the loop buffer circuit 210 is configured to capture (i.e., loop buffer) the instructions 206 in the detected loop to be replayed to avoid or reduce the need to re-fetch the instructions 206 in the detected loop, since the processing of these instructions 206 is repeated in the instruction pipeline I.sub.0-I.sub.N. In this regard, the loop buffer circuit 210 is configured to insert (i.e., replay) the captured loop instructions 206 in an instruction pipeline I.sub.0-I.sub.N for iterations of the loop. In this manner, the instructions 206 in the captured loop do not have to be re-fetched and/or re-decoded, for example, for the subsequent iterations of the loop. Thus, loop buffering can conserve power by the instruction fetch circuit 212 not having to re-fetch the instructions 206 in a detected loop for subsequent iterations of the loop. Loop buffering can also conserve power by the instruction decode circuit 220 not having to re-decode the instructions 206 in a detected loop for subsequent iterations of the loop.
[0039] As discussed in more detail below, the loop buffer circuit 210 is also configured to determine if loop optimizations are available to be made in run-time based on a captured loop to enhance performance of the replay of the loop, and to perform such loop optimizations if available. Because the captured loop may contain more instructions 206 for a captured loop than would otherwise be present in an instruction pipeline I.sub.0-I.sub.N or a particular pipeline stage for processing at a given time, the processor can use this enhanced visibility of a larger number of instructions 206 in a loop captured in the loop buffer circuit 210 to determine loop optimizations for the loop. These loop optimizations may not be possible to determine otherwise at compile time and/or at run-time based only on the knowledge of the presence of certain instructions 206 of the loop within an instruction pipeline I.sub.0-I.sub.N. In this regard, if the loop buffer circuit 210 determines that, if loop optimizations are available to be made based on a captured loop, the loop buffer circuit 210 is configured to modify at least one instruction 206 in the captured loop to produce an optimized loop. The optimized loop can then be replayed in an instruction pipeline I.sub.0-I.sub.N when the loop is to be re-processed and re-executed in the instruction pipeline I.sub.0-I.sub.N in an iteration(s) so that the loop optimization is realized by the processor 200. To effectuate loop optimizations, the loop buffer circuit 210 is configured to cause an optimized loop to be replayed that is injected into the instruction pipeline I.sub.0-I.sub.N in one of a number of stages, including the rename/allocate circuit 222 (e.g., instruction replay), the instruction fetch circuit 212 (e.g., for controlling/pausing new instruction 206 fetching during replay), and the issue circuit 230 (for providing scheduling hints to schedule issuance of replayed instructions 206D).
[0040]
[0041] In response to the loop detection circuit 302 detecting a loop of stored instructions 206D in the instruction stream 208 as a loop (block 404 in
[0042] With continuing reference to
[0043] For example, as discussed in more detail below, certain loop optimizations may be available to be made by the loop optimization circuit 318 based on the captured loop 308 that reduce the number of instructions 206D required to be replayed in the captured loop 308 to still achieve the same functionality of the captured loop 308 when processed in a replay of the captured loop 308 in the instruction processing circuit 204. Also, as discussed in more detail below, other loop optimizations may be available to be made by the loop optimization circuit 318 based on the captured loop 308 that reduce the number of clock cycles required to process and execute a replay of the captured loop 308 in the instruction processing circuit 204, as compared to the number of clock cycles required to execute the replay of the original captured instructions 206D of the captured loop 308 with the same functionality. Also, as discussed in more detail below, other loop optimizations may be available to be made by the loop optimization circuit 318 based on the captured loop 308 that provide for critical instructions, such as timing critical instructions (e.g., load or instructions that are unlocking instructions to unlock dependence flow paths, to be indicated with scheduling hints to be scheduled for execution at a higher priority when replayed in the instruction processing circuit 204). In this manner, such critical instructions may be executed earlier thus making their produced results ready earlier to be consumed by other consumer instructions in the captured loop 308 that are replayed. This can increase the throughput of replaying captured loops 308 in the instruction processing circuit 204.
[0044] Also, as discussed in more detail below, yet other loop optimizations may be available to be made by the loop optimization circuit 318 based on the captured loop 308 that can identify instructions that are load/store operations that can separated from the captured loop 308 as an instruction execution slice. An instruction execution slice in a captured loop is a set of instructions 206D in the captured loop 308 that compute load/store memory addresses needed for memory load/store instructions to be executed in replay of the captured loop 308. The loop optimization circuit 318 can be configured to convert an identified extracted instruction execution slice from a captured loop 308 into a software prefetch instruction(s) that can then be injected into a pre-fetch stage(s) in the instruction pipeline I.sub.0-I.sub.N when the captured loop 308 is replayed to perform the loop optimization for the captured loop 308. The processing of the software prefetch instruction(s) for the instruction execution slice will cause the instruction processing circuit 204 to perform the extracted instructions 206D in the instruction execution slice earlier in the instruction pipeline I.sub.0-I.sub.N as pre-fetch instructions 206. Thus, any resulting cache misses from the memory operations performed by processing the extracted execution slice instructions as pre-fetch instructions 206 can be recovered earlier for consumption by the dependent instructions in the captured loop 308 when the captured loop 308 is replayed.
[0045] With continued reference to
[0046] The loop replay circuit 314 is also coupled to the instruction fetch circuit 212 in this example. This is so that when the loop replay circuit 314 replays a loop, the loop replay circuit 314 can send a loop replay indicator 316 to the instruction fetch circuit 212. The instruction fetch circuit 212 can discontinue fetching of instructions 206D for the captured loop 308 while they are being replayed (inserted) into the instruction pipeline I.sub.0-I.sub.N of the instruction processing circuit 204.
[0047] As discussed above, some captured loops 308 may have an available optimization where instructions 206D in the captured loops 308 can be modified by being removed or combined to optimize the captured loop 308 into an optimized loop 3080 for replay. In this regard,
[0048] The loop optimization circuit 318 in
[0049]
[0050] In this regard, the process steps 602, 604, 606 are the same as process steps 402, 404, 406 in the process 400 in
[0051] Note that the loop buffer circuit 300 can be configured to find producer and consumer pair instructions 206D in a captured loop 308 that can be fused in an optimized loop 3080 to provide a loop optimization. Also note that the loop buffer circuit 300 can also be configured to find producer and consumer pair instructions 206D that occur across different iterations of a captured loop 308 when replayed. For example, the same instruction 206D in captured loop 308 may be both a producer and consumer instruction. Such an instruction 206D be a producer instruction for itself as a consumer instruction in a subsequent iteration of replay of the captured loop 308. Thus, the loop buffer circuit 300 can be configured to identify instructions 206D in a captured loop 308 that can be fused with itself to produce an optimized loop 3080 for replay.
[0052]
[0053] Note that there are other examples of instructions 206D that can be in a captured loop 308 that can be transformed to reduced strength instructions so that the captured loop 308 can be replayed faster and more efficiently. For example, an instruction 206D in a capture loop 308 determined to be an add by zero function could be replaced with a move instruction in an optimized loop 3080.
[0054] As another example, the captured loop 308 may contain an instruction 206D that is loop invariant, meaning that the produced value of execution of such instruction 206D will always be the same for any iteration of the replayed loop. For example, such a loop invariant instruction may be an instruction that stores a constant value to a target register, wherein the target register is not modified by any other producer instruction. In this example, to optimize a captured loop 308 with such a loop invariant instruction 206D, the loop optimization circuit 318 in
[0055] In another exemplary aspect, the loop buffer circuit 300, and its loop optimization circuit 318, in
[0056]
[0057] Thus, in this example, the loop optimization circuit 318 can be configured to determine if the load instruction 800(2) is a producer instruction that is a critical timing instruction to the consumer conditional branch instruction 800(7). The loop optimization circuit 318 can be configured to provide a scheduling hint SH in scheduling priority indicator 802(2) associated with the instruction entry 310(2) that contains the load instruction 800(2) as the optimized loop 3080(3) as shown in
[0058] As another example, the captured loop 308 may contain a critical instruction 206D that is critical as an unlocking instruction 206D between parallel dependence chains within a captured loop 308. For example, a captured loop 308 may contain many independent load instructions 206D or longer-latency instructions 206D that are producer instructions to other consumer instructions. These load instructions 206D or longer-latency instructions 206D that are producer instructions to other consumer instructions are known as critical “unlocking” instructions. Thus, these unlocking instructions 206D could be prioritized to be executed earlier in a replay of a captured loop 308 to realize additional performance from other consumer instructions being able to be issued sooner by the issue circuit 230 in
[0059] In another exemplary aspect, the loop buffer circuit 300, and its loop optimization circuit 318, in
[0060] Thus, as discussed in more detail below, the loop buffer circuit 300 can be configured to extract an identified instruction execution slice identified in the instructions 206D of a captured loop 308. The loop buffer circuit 300 can be configured to convert an identified extracted instruction execution slice into a software prefetch instruction(s) that can then be injected into a pre-fetch stage(s) in the instruction pipeline, such as an instruction pipeline I.sub.0-I.sub.N in the processor 200 in
[0061] In this regard,
[0062] Thus, the loop optimization circuit 318 in
[0063] This is shown in the example processor 1000 in the processor-based system 1002 in
[0064] Note that in one example, the instructions 900(1), 900(3) of the prefetch slice 902 can be removed by the loop optimization circuit 318 from the loop buffer memory 312 altogether such that the remaining instructions 206 to be replayed as normal instructions in the optimized loop 3080(4) are instructions 900(2) and 900(4)-900(6). Alternatively, the loop optimization circuit 318 can leave the instructions 900(1), 900(3) of the instruction execution slice 902 remaining the loop buffer memory 312 as shown in
[0065] For example, as shown in
[0066] Note that the software prefetch instructions 206 of the instruction execution slice 902 can be noted as non-architectural instructions, meaning that the instruction processing circuit 1004 will not allocate resources for the processing of such instructions, such as positions in a reorder buffer, committed mapping table, etc. Thus, work performed in the instruction pipeline I.sub.0-I.sub.N of the instruction processing circuit 1004 in
[0067]
[0068] In this regard, the process steps 1102, 1104, 1106 are the same as process steps 402, 404, 406 in the process 400 in
[0069]
[0070] The processor-based system 1200 may be a circuit or circuits included in an electronic board card, such as a printed circuit board (PCB), a server, a personal computer, a desktop computer, a laptop computer, a personal digital assistant (PDA), a computing pad, a mobile device, or any other device, and may represent, for example, a server, or a user's computer. In this example, the processor-based system 1200 includes the processor 1202. The processor 1202 represents one or more processing circuits, such as a microprocessor, central processing unit, or the like. The processor 1202 is configured to execute processing logic in instructions for performing the operations and steps discussed herein. Fetched or prefetched instructions from a memory, such as from a system memory 1210 over a system bus 1212, are stored in an instruction cache 1208. The instruction processing circuit 1204 is configured to process instructions 1205 fetched into the instruction cache 1208 and process the instructions for execution. These instructions 1205 fetched from the instruction cache 1208 to be processed can include loops that are detected by the loop buffer circuit 1206 for replay based on prediction of one or more loop characteristics as loop characteristic predictions.
[0071] The processor 1202 and the system memory 1210 are coupled to the system bus 1212 and can intercouple peripheral devices included in the processor-based system 1200. As is well known, the processor 1202 communicates with these other devices by exchanging address, control, and data information over the system bus 1212. For example, the processor 1202 can communicate bus transaction requests to a memory controller 1214 in the system memory 1210 as an example of a slave device. The instructions 1205 can also be stored in the system memory 1210 and retrieved from system memory 1210 for execution by the instruction processing circuit 1204. Although not illustrated in
[0072] Other devices can be connected to the system bus 1212. As illustrated in
[0073] The processor-based system 1200 in
[0074] While the non-transitory computer-readable medium 1232 is shown in an exemplary embodiment to be a single medium, the term “computer-readable medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that stores the one or more sets of instructions. The term “computer-readable medium” shall also be taken to include any medium that is capable of storing, encoding, or carrying a set of instructions for execution by the processing device and that causes the processing device to perform any one or more of the methodologies of the embodiments disclosed herein. The term “computer-readable medium” shall accordingly be taken to include, but not be limited to, solid-state memories, optical medium, and magnetic medium.
[0075] The embodiments disclosed herein include various steps. The steps of the embodiments disclosed herein may be formed by hardware components or may be embodied in machine-executable instructions, which may be used to cause a general-purpose or special-purpose processor programmed with the instructions to perform the steps. Alternatively, the steps may be performed by a combination of hardware and software.
[0076] The embodiments disclosed herein may be provided as a computer program product, or software, that may include a machine-readable medium (or computer-readable medium) having stored thereon instructions, which may be used to program a computer system (or other electronic devices) to perform a process according to the embodiments disclosed herein. A machine-readable medium includes any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer). For example, a machine-readable medium includes: a machine-readable storage medium (e.g., ROM, random access memory (“RAM”), a magnetic disk storage medium, an optical storage medium, flash memory devices, etc.); and the like.
[0077] Unless specifically stated otherwise and as apparent from the previous discussion, it is appreciated that throughout the description, discussions utilizing terms such as “processing,” “computing,” “determining,” “displaying,” or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data and memories represented as physical (electronic) quantities within the computer system's registers into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission, or display devices.
[0078] The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Various systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatuses to perform the required method steps. The required structure for a variety of these systems will appear from the description above. In addition, the embodiments described herein are not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the embodiments as described herein.
[0079] Those of skill in the art will further appreciate that the various illustrative logical blocks, modules, circuits, and algorithms described in connection with the embodiments disclosed herein may be implemented as electronic hardware, instructions stored in memory or in another computer-readable medium and executed by a processor or other processing device, or combinations of both. The components of the distributed antenna systems described herein may be employed in any circuit, hardware component, integrated circuit (IC), or IC chip, as examples. Memory disclosed herein may be any type and size of memory and may be configured to store any type of information desired. To clearly illustrate this interchangeability, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. How such functionality is implemented depends on the particular application, design choices, and/or design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present embodiments.
[0080] The various illustrative logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), or other programmable logic device, a discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. Furthermore, a controller may be a processor. A processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices (e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration).
[0081] The embodiments disclosed herein may be embodied in hardware and in instructions that are stored in hardware, and may reside, for example, in RAM, flash memory, ROM, Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), registers, a hard disk, a removable disk, a CD-ROM, or any other form of computer-readable medium known in the art. An exemplary storage medium is coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a remote station. In the alternative, the processor and the storage medium may reside as discrete components in a remote station, base station, or server.
[0082] It is also noted that the operational steps described in any of the exemplary embodiments herein are described to provide examples and discussion. The operations described may be performed in numerous different sequences other than the illustrated sequences. Furthermore, operations described in a single operational step may actually be performed in a number of different steps. Additionally, one or more operational steps discussed in the exemplary embodiments may be combined. Those of skill in the art will also understand that information and signals may be represented using any of a variety of technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips, that may be references throughout the above description, may be represented by voltages, currents, electromagnetic waves, magnetic fields, or particles, optical fields or particles, or any combination thereof.
[0083] Unless otherwise expressly stated, it is in no way intended that any method set forth herein be construed as requiring that its steps be performed in a specific order. Accordingly, where a method claim does not actually recite an order to be followed by its steps, or it is not otherwise specifically stated in the claims or descriptions that the steps are to be limited to a specific order, it is in no way intended that any particular order be inferred.
[0084] It will be apparent to those skilled in the art that various modifications and variations can be made without departing from the spirit or scope of the invention. Since modifications, combinations, sub-combinations and variations of the disclosed embodiments incorporating the spirit and substance of the invention may occur to persons skilled in the art, the invention should be construed to include everything within the scope of the appended claims and their equivalents.