Opportunistic consumer instruction steering based on producer instruction value prediction in a multi-cluster processor
11327763 · 2022-05-10
Assignee
Inventors
- Arthur Perais (Morrisville, NC)
- Shivam Priyadarshi (Morrisville, NC)
- Yusuf Cagatay Tekmen (Raleigh, NC, US)
- Rami Mohammad Al Sheikh (Morrisville, NC)
- Vignyan Reddy KOTHINTI NARESH (Redmond, WA, US)
Cpc classification
G06F9/3836
PHYSICS
G06F9/3806
PHYSICS
G06F9/3828
PHYSICS
International classification
Abstract
Opportunistic consumer instruction steering based on producer instruction value prediction in a multi-cluster processor is disclosed. A processor provides producer instructions and consumer instructions to a steering circuit that steers the program instructions to clusters of instruction execution circuits. An input value provided to a consumer instruction may be a produced value of a producer instruction, creating a dependency. The steering circuit steers a producer instruction to a first cluster and, in response to receiving the consumer instruction and the predicted value of the producer instruction, provides the predicted value to at least a second cluster and steers the consumer instruction to the second cluster for execution with the predicted value as the input value. A consumer instruction can be executed in a different cluster than a producer instruction without a cluster-to-cluster latency penalty, which allows the instruction loads to be better balanced among the clusters for higher processor throughput.
Claims
1. A multi-cluster processor, comprising: a plurality of clusters, each cluster comprising a plurality of instruction execution circuits configured to execute program instructions comprising producer instructions and consumer instructions; a value predictor circuit configured to generate a predicted value of a producer instruction, the predicted value comprising a prediction of a produced value of the producer instruction; and a steering circuit configured to: receive the producer instruction; in response to receiving the producer instruction: steer the producer instruction to a first cluster among the plurality of clusters for execution; receive a predicted value comprising a prediction of the produced value of the producer instruction; determine one or more second cluster among the plurality of clusters to which to make the predicted value available; and make the predicted value available to the one or more second cluster among the plurality of clusters; receive a consumer instruction that depends on the produced value of the producer instruction as an input value; and in response to receiving the consumer instruction: determine to steer the consumer instruction to the one or more second cluster among the plurality of clusters; and steer the consumer instruction to the one or more second cluster of the plurality of clusters for execution using the predicted value as the input value.
2. The multi-cluster processor of claim 1, wherein the steering circuit is further configured to make the predicted value available to the first cluster.
3. The multi-cluster processor of claim 1, wherein the steering circuit is configured to: receive a second consumer instruction that depends on the produced value as the input value; determine to steer the second consumer instruction to the one or more second cluster among the plurality of clusters; and steer the second consumer instruction to the one or more second cluster among the plurality of clusters.
4. The multi-cluster processor of claim 1, further configured to receive the consumer instruction in a same cycle as the producer instruction.
5. The multi-cluster processor of claim 4, wherein: each cluster of the plurality of clusters further comprises a scheduler circuit configured to schedule instructions to the plurality of instruction execution circuits in the cluster; and the steering circuit configured to make the predicted value available to the one or more second cluster is further configured to store the predicted value in the scheduler circuit of the second cluster.
6. The multi-cluster processor of claim 1, further configured to receive the consumer instruction in a later cycle after a first cycle in which the producer instruction is received.
7. The multi-cluster processor of claim 1, wherein: each cluster of the plurality of clusters further comprises a plurality of physical registers; and the steering circuit configured to make the predicted value available to the one or more second cluster is further configured to store the predicted value in one of the plurality of physical registers of the first cluster and in one of the plurality of physical registers of each of the one or more second cluster.
8. The multi-cluster processor of claim 7, wherein the steering circuit is further configured to store the predicted value in one of the plurality of physical registers in a third cluster among the plurality of clusters.
9. The multi-cluster processor of claim 7, further comprising: a register alias table (RAT) configured to associate an architected register with one of the plurality of physical registers in each of the plurality of clusters, wherein the steering circuit is further configured to update the RAT to associate an architected register corresponding to the input value of the consumer instruction with the one of the plurality of physical registers in which the predicted value is stored in the first cluster and the one or more second cluster among the plurality of clusters.
10. The multi-cluster processor of claim 9, wherein the steering circuit is further configured to: access the RAT to identify clusters among the plurality of clusters in which the predicted value is stored in the physical register associated with the architected register corresponding to the input value; and determine that the one or more second cluster is among the identified clusters.
11. A method of a steering circuit in a multi-cluster processor comprising a value predictor circuit, the method comprising: receiving, in the steering circuit, a producer instruction; in response to receiving the producer instruction: steering the producer instruction to a first cluster among a plurality of clusters for execution; receiving a predicted value comprising a prediction of the produced value of the producer instruction; determining one or more second clusters among the plurality of clusters to which to make the predicted value available; making the predicted value available to the determined one or more second cluster among the plurality of clusters; receiving, in the steering circuit, a consumer instruction that depends on the produced value of the producer instruction as an input value; and in response to receiving the consumer instruction: determining to steer the consumer instruction to the one or more second cluster among the plurality of clusters; and steering the consumer instruction to the one or more second cluster of the plurality of clusters for execution using the predicted value as the input value.
12. The method of claim 11, further comprising making the predicted value available to the first cluster.
13. The method of claim 11, further comprising: receiving a second consumer instruction that depends on the produced value as the input value; determining to steer the second consumer instruction to the one or more second clusters among the plurality of clusters; and steering the second consumer instruction to the one more second cluster among the plurality of clusters.
14. The method of claim 11, further comprising receiving the consumer instruction in a same cycle as the producer instruction.
15. The method of claim 14, wherein: each cluster of the plurality of clusters comprises a scheduler circuit configured to schedule instructions to the cluster; and making the predicted value available to the second cluster further comprises storing the predicted value in the scheduler circuit of the second cluster.
16. The method of claim 11, further comprising receiving the consumer instruction in a later cycle after a first cycle in which the producer instruction is received.
17. The method of claim 16, wherein: each cluster of the plurality of clusters further comprises a plurality of physical registers; and making the predicted value available to the second cluster further comprises storing the predicted value in one of the plurality of physical registers of the first cluster and in one of the physical registers of each of the one or more second cluster.
18. The method of claim 17, further comprising storing the predicted value in one of the plurality of physical registers in a third cluster among the plurality of clusters.
19. The method of claim 17, further comprising: updating a register alias table (RAT) to associate an architected register corresponding to the input value of the consumer instruction with the one of the plurality of physical registers in which the predicted value is stored in the first cluster and in the one or more second cluster among the plurality of clusters.
20. The method of claim 19, further comprising: accessing the RAT to identify clusters among the plurality of clusters in which the predicted value is stored in the physical register associated with the architected register corresponding to the input value; and determining that the one or more second cluster is among the identified clusters.
21. The multi-cluster processor of claim 1, further comprising a central physical register file comprising a plurality of physical registers, wherein: each cluster of the plurality of clusters comprises access to the central physical register file; and the steering circuit configured to make the predicted value available to the one or more second cluster is further configured to store the predicted value in the central physical register file.
22. The method of claim 11, wherein: each cluster of the plurality of clusters comprises access to a central physical register file comprising a plurality of physical registers; and making the predicted value available to the one or more second cluster among the plurality of clusters further comprises storing the predicted value in the central physical register file.
Description
BRIEF DESCRIPTION OF THE DRAWING FIGURES
(1) The accompanying drawing figures incorporated in and forming a part of this specification illustrate several aspects of the disclosure, and together with the description serve to explain the principles of the disclosure.
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
DETAILED DESCRIPTION
(10) Exemplary aspects disclosed herein include opportunistic consumer instruction steering based on producer instruction value prediction in a multi-cluster processor. The processor provides groups of program instructions to a steering circuit that steers the program instructions to a plurality of clusters in the processor for execution. Each of the clusters includes a plurality of instruction execution circuits or pipelines for executing program instructions. The program instructions include producer instructions that generate produced values and consumer instructions that require an input value for execution. An input value provided to a consumer instruction may be a produced value of a producer instruction, making the consumer instruction dependent on the producer instruction. The consumer instruction may be steered to a different cluster than the producer instruction on which it depends to balance cluster loads, but there is a cluster-to-cluster latency when passing the produced value from one cluster to another. The processor also includes a value predictor circuit for generating a predicted value, which is a prediction of the produced value of the producer instruction, before the producer instruction is executed. The steering circuit steers a producer instruction to a first cluster and, in response to receiving the consumer instruction and the predicted value of the producer instruction, provides the predicted value to at least a second cluster and steers the consumer instruction to the second cluster for execution with the predicted value as the input value. In this manner, a consumer instruction can be executed in a different cluster than a producer instruction without a cluster-to-cluster latency penalty, and this allows the instruction loads to be better balanced among the clusters for higher processor throughput.
(11) Before discussing an exemplary multi-cluster processor that includes a steering circuit configured to steer a producer instruction to a first cluster and opportunistically steer a consumer instruction to a second cluster in response to receiving the consumer instruction and a predicted value of an input value starting at
(12) In this regard,
(13) The fetched instructions 106 include instructions that use (“consume”) output values generated (“produced”) by previous instructions and also produce output values that will be consumed by subsequent instructions. An instruction may be referred to as both a producer instruction if it generates a produced value and a consumer instruction if it consumes produced values of producer instructions. In this context, however, the designation of a producer instruction and a consumer instruction identifies a relationship between two instructions.
(14) The instruction pipelines I.sub.0-I.sub.N are provided across different processing circuits or stages of the instruction processing circuit 104 to pre-process and process the fetched instructions 106 in a series of steps that can be performed concurrently to increase throughput prior to execution of the fetched instructions 106 by the functional units 110(0)-110(U). A control flow prediction circuit 120 (e.g., a branch prediction circuit) is also provided in the instruction processing circuit 104 in the processor 102 in
(15) In this example, the decoded instructions 106 are placed in one or more of the instruction pipelines I.sub.0-I.sub.N and are next provided to a rename circuit 124 in the instruction processing circuit 104. The rename circuit 124 is configured to determine if any register names in the decoded instructions 106 need to be renamed to break any register dependencies that would prevent parallel or out-of-order processing. The instruction processing circuit 104 includes a value predictor circuit 126 used for dataflow speculation to make predictions of produced values that will be produced by producer instructions. Dataflow speculation generates predicted values to improve performance by allowing a consumer instruction to be executed sooner based on a level of confidence in the predicted value. Value predictions may be employed in clustered and non-clustered processors.
(16) In examples herein, the rename circuit 124 identifies a physical register 128 to be associated with a logical destination register of a producer instruction in a rename alias table 130. When a predicted value of the produced value of the producer instruction is available, an RACC circuit 132 writes the predicted value to the identified physical register 128 associated with the logical destination register. The RACC circuit 132 then allows the predicted value to be obtained from the physical registers 128 by a consumer instruction that can use the predicted value as an input value. Using the predicted value as the input value, rather than waiting for the producer instruction to generate the produced value, the consumer instruction may be executed out of order in one of the functional units 110(0)-110(U) with a high degree of confidence.
(17) The rename circuit 124 is configured to call upon a rename alias table 130 to rename a logical source register operand and/or write a destination register operand of a decoded instruction 106 to available physical registers P.sub.0, P.sub.1, . . . , P.sub.X in physical registers 128 of a physical register file. The rename alias table 130 contains a plurality of register mapping entries 134(0)-134(P) each mapped to (i.e., associated with) a respective logical register R.sub.0-R.sub.P which are architected registers of the processor 102. The register mapping entries 134(0)-134(P) are each configured to store respective mapping information for the corresponding logical registers R.sub.0-R.sub.P to a physical register P.sub.0-P.sub.X in the physical registers 128. Each physical register P.sub.0-P.sub.X is configured to store a data entry 136(0)-136(X) for the source and/or destination register operand of a decoded instruction 106.
(18)
(19)
(20) In the example in
(21) In the example in
(22) The instruction processing circuit 314 may be the instruction processing circuit 104 in
(23) In other examples, the number and/or capabilities of the functional units 312 in each cluster 302 may vary, which will affect the policies used by the steering circuit 304 for distributing instructions 306, but such variations are within the scope of the exemplary aspects disclosed herein.
(24) In the example in
(25) The steering circuit 304 in
(26)
(27) With further reference to
(28)
(29)
(30) In the example shown, the processor 300 may include multiple clusters (not shown) 302A-302C. Upon receiving the predicted value 308 corresponding to the producer instruction I0.sup.P1, the steering circuit 304 may steer the producer instruction I0.sup.P1 to cluster 302A, and provide the predicted value 308 to each of clusters 302B and 302C.
(31) In the second cycle, a producer instruction I6.sup.P2 and a consumer instruction I7.sup.C1 which depends on the producer instruction I0.sup.P1 are received in steering group 2. The steering circuit 304 determines that the predicted value 308 for the producer instruction I0.sup.P1 is already available to clusters 302B, 302C and steers the consumer instruction I7.sup.C1 to, for example, cluster 302B. As a result, the consumer instruction I7.sup.C1 is able to begin execution immediately using the predicted value 308 as an input value. This avoids the need to wait for the producer instruction I0.sup.P1 to complete execution in cluster 302A, which can take several cycles depending on the instruction type, and avoids the cluster-to-cluster latency that would be incurred if the consumer instruction I7.sup.C1 is executed in a different cluster than the producer instruction I0.sup.P1.
(32) Also, in steering group 2, the producer instruction I6.sup.P2 may be steered to cluster 302A, 302B, or 302C because each cluster 302 is capable of receiving multiple instructions 306 per cycle. In response to receiving the predicted value 308 for the producer instruction I6.sup.P2, the steering circuit 304 provides the predicted value 308 to at least one, and up to all, of the clusters 302A, 302B, and 302C in anticipation of consumer instructions 306 that depend on producer instruction I6.sup.P2. Instructions 14 and 15 are not dependent on producer instruction I0.sup.P1 or producer instruction I6.sup.P2.
(33) In a third cycle, steering group 3 includes another consumer instruction I9.sup.C1 that is a consumer instruction 306C dependent on the producer instruction I0.sup.P1. The steering circuit 304 is able to determine that the predicted value 308 for producer instruction I0.sup.P1 is available in any of clusters 302A-302C and steers the consumer instruction to one of those clusters 302 for execution using the predicted value 308. Instructions I8, I10, and I11 are not dependent on producer instruction I0.sup.P1 or producer instruction I6.sup.P2.
(34) In a fourth cycle, steering group 4 includes another consumer instruction I12.sup.C1 that is a consumer instruction 306C dependent on the producer instruction I0.sup.P1 and also includes consumer instruction I14.sup.C2 that is a consumer instruction 306C dependent on the producer instruction I6.sup.P2. The steering circuit 304 is able to determine that the predicted value 308 for producer instruction I0.sup.P1 is available in clusters 302A, 302B, and 302C and steers the consumer instruction I12.sup.C1 to one of these clusters 302 for execution using the predicted value 308. The steering circuit 304 is able to determine that the predicted value 308 for producer instruction I6.sup.P2 is available in clusters 302A, 302B, and 302C and steers the consumer instruction I14.sup.C2 to one of these clusters 302 for execution using the predicted value 308. Instructions I13 and I15 are not dependent on producer instruction I0.sup.P1 or producer instruction I6.sup.P2.
(35) Although the steering circuit 304 can determine which clusters 302 have been provided a predicted value 308 and avoid a cluster-to-cluster latency by steering a consumer instruction 306C to one of such clusters 302, the steering circuit 304 may also choose to steer the consumer instruction 306C to a cluster 302 that has not been provided the predicted value 308, recognizing that the cluster-to-cluster latency penalty will be incurred.
(36) As described above, the steering circuit 304 provides predicted values 308 to one or more clusters 302 in response to receiving the predicted value 308 for a producer instruction 306P. To do so, as shown in
(37) The illustration in
(38) In operation, when a producer instruction 306P is received, the RAT 600 may be updated by the steering circuit 304 to associate an architected register corresponding to the input value for the consumer instruction 306C with one of the plurality of physical registers 316 in which the predicted value 308 is stored in the clusters 302A-302D. When a consumer instruction 306C is received in a steering group, the RAT 600 may be accessed to retrieve the association of the architected register to a physical register 316 determine the cluster 302 to which a consumer instruction 306C should be steered. Reclamation of architected registers occurs when instructions are committed, in accordance with conventional RAT operation.
(39)
(40) In the example in
(41)
(42) The processor 802 and the system memory 810 are coupled to the system bus 812 and can intercouple peripheral devices included in the processor-based system 800. As is well known, the processor 802 communicates with these other devices by exchanging address, control, and data information over the system bus 812. For example, the processor 802 can communicate bus transaction requests to a memory controller 814 in the system memory 810 as an example of a slave device. Although not illustrated in
(43) Other devices can be connected to the system bus 812. As illustrated in
(44) The processor-based system 800 in
(45) While the computer-readable medium 832 is shown in an exemplary embodiment to be a single medium, the term “computer-readable medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that stores the one or more sets of instructions. The term “computer-readable medium” shall also be taken to include any medium that is capable of storing, encoding, or carrying a set of instructions for execution by the processing device and that causes the processing device to perform any one or more of the methodologies of the embodiments disclosed herein. The term “computer-readable medium” shall accordingly be taken to include, but not be limited to, solid-state memories, optical medium, and magnetic medium.
(46) The embodiments disclosed herein include various steps. The steps of the embodiments disclosed herein may be formed by hardware components or may be embodied in machine-executable instructions, which may be used to cause a general-purpose or special-purpose processor programmed with the instructions to perform the steps. Alternatively, the steps may be performed by a combination of hardware and software.
(47) The embodiments disclosed herein may be provided as a computer program product, or software, that may include a machine-readable medium (or computer-readable medium) having stored thereon instructions, which may be used to program a computer system (or other electronic devices) to perform a process according to the embodiments disclosed herein. A machine-readable medium includes any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer). For example, a machine-readable medium includes: a machine-readable storage medium (e.g., ROM, random access memory (“RAM”), a magnetic disk storage medium, an optical storage medium, flash memory devices, etc.); and the like.
(48) Unless specifically stated otherwise and as apparent from the previous discussion, it is appreciated that throughout the description, discussions utilizing terms such as “processing,” “computing,” “determining,” “displaying,” or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data and memories represented as physical (electronic) quantities within the computer system's registers into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission, or display devices.
(49) The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Various systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatuses to perform the required method steps. The required structure for a variety of these systems will appear from the description above. In addition, the embodiments described herein are not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the embodiments as described herein.
(50) Those of skill in the art will further appreciate that the various illustrative logical blocks, modules, circuits, and algorithms described in connection with the embodiments disclosed herein may be implemented as electronic hardware, instructions stored in memory or in another computer-readable medium and executed by a processor or other processing device, or combinations of both. The components of the distributed antenna systems described herein may be employed in any circuit, hardware component, integrated circuit (IC), or IC chip, as examples. Memory disclosed herein may be any type and size of memory and may be configured to store any type of information desired. To clearly illustrate this interchangeability, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. How such functionality is implemented depends on the particular application, design choices, and/or design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present embodiments.
(51) The various illustrative logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), or other programmable logic device, a discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. Furthermore, a controller may be a processor. A processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices (e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration).
(52) The embodiments disclosed herein may be embodied in hardware and in instructions that are stored in hardware, and may reside, for example, in RAM, flash memory, ROM, Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), registers, a hard disk, a removable disk, a CD-ROM, or any other form of computer-readable medium known in the art. An exemplary storage medium is coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a remote station. In the alternative, the processor and the storage medium may reside as discrete components in a remote station, base station, or server.
(53) It is also noted that the operational steps described in any of the exemplary embodiments herein are described to provide examples and discussion. The operations described may be performed in numerous different sequences other than the illustrated sequences. Furthermore, operations described in a single operational step may actually be performed in a number of different steps. Additionally, one or more operational steps discussed in the exemplary embodiments may be combined. Those of skill in the art will also understand that information and signals may be represented using any of a variety of technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips, that may be references throughout the above description, may be represented by voltages, currents, electromagnetic waves, magnetic fields, or particles, optical fields or particles, or any combination thereof.
(54) Unless otherwise expressly stated, it is in no way intended that any method set forth herein be construed as requiring that its steps be performed in a specific order. Accordingly, where a method claim does not actually recite an order to be followed by its steps, or it is not otherwise specifically stated in the claims or descriptions that the steps are to be limited to a specific order, it is in no way intended that any particular order be inferred.
(55) It will be apparent to those skilled in the art that various modifications and variations can be made without departing from the spirit or scope of the invention. Since modifications, combinations, sub-combinations and variations of the disclosed embodiments incorporating the spirit and substance of the invention may occur to persons skilled in the art, the invention should be construed to include everything within the scope of the appended claims and their equivalents.