Architecture for long latency operations in emulated shared memory architectures

10127048 ยท 2018-11-13

Assignee

Inventors

Cpc classification

International classification

Abstract

A processor architecture arrangement for emulated shared memory (ESM) architectures, comprises a number of, preferably a plurality of, multi-threaded processors each provided with interleaved inter-thread pipeline, wherein the pipeline comprises a plurality of functional units arranged in series for executing arithmetic, logical and optionally further operations on data, wherein one or more functional units of lower latency are positioned prior to the memory access segment in said pipeline and one or more long latency units (LLU) for executing more complex operations associated with longer latency are positioned operatively in parallel with the memory access segment. In some embodiments, the pipeline may contain multiple branches in parallel with the memory access segment, each branch containing at least one long latency unit.

Claims

1. A processor architecture arrangement for emulated shared memory (ESM) architectures, comprising a number of multi-threaded processors each provided with interleaved inter-thread pipeline having a memory access segment and a plurality of functional units arranged in series for executing arithmetic and, logical operations on data, wherein one or more of the functional units have lower latency and are positioned prior to the memory access segment in said pipeline, and wherein one or more of the functional units have long latency for executing more complex operations associated with the functional units requiring longer latency than the functional units requiring lower latency, and wherein the one or more functional units having long latency are positioned operatively in parallel with the memory access segment.

2. The processor architecture arrangement of claim 1, wherein a number of functional units are functionally positioned after the memory access segment in the pipeline.

3. The processor architecture arrangement of claim 1, wherein at least two long functional latency units are chained together, wherein a long latency functional unit is configured to pass operation result to a subsequent unit in the chain as an operand.

4. The processor architecture arrangement of claim 1, wherein one or more functional units of lower latency include at least one ALU for integer arithmetics.

5. The processor architecture of claim 1, wherein the pipeline incorporates at least two parallel branches, each branch including at least one long latency functional unit in parallel with the memory access segment.

6. The processor architecture arrangement of claim 5, wherein at least two branches extend beyond the memory access segment relative to the pipeline, the extensions preceding and/or following the memory access segment.

7. The processor architecture arrangement of claim 6, wherein a number of functional units in a branch are functionally positioned substantially before and/or after the memory access segment.

8. The processor architecture arrangement of claim 1, including at least two long latency functional units in parallel with the memory access segment that are mutually of different complexity in terms operation execution latency.

9. The processor architecture arrangement of claim 8, wherein a long latency functional unit associated with longer latency is logically located operatively in parallel with an end portion of the memory access segment and after a long latency functional unit associated with shorter latency.

10. The processor architecture arrangement of claim 1, wherein one or more functional units are controlled through a number of operation selection fields of instruction words.

11. The processor architecture arrangement of claim 1, wherein a number of operands for a functional unit are determined in an operand select stage of the pipeline in accordance with a number of operand selection fields given in an instruction word.

12. The processor architecture arrangement of claim 1, wherein at least one long latency functional unit is designed for division, root calculation or an application-specific purpose.

13. The processor architecture arrangement of claim 1, including at least one long latency functional unit configured to execute one or more operations on input data and at least one other long latency functional unit configured to execute one or more other operations on input data.

Description

BRIEF DESCRIPTION OF THE RELATED DRAWINGS

(1) Next the invention is described in more detail with reference to the appended drawings in which

(2) FIG. 1 is a block diagram of a feasible scalable architecture to emulate shared memory on a silicon platform.

(3) FIG. 2 is another representation of a feasible ESM architecture, essentially CMP ESM architecture.

(4) FIG. 3 is a high-level block diagram and pipeline representation of an embodiment of an MCRCW ESM processor.

(5) FIG. 4 illustrates an embodiment of the pipeline architecture in accordance with the present invention.

(6) FIG. 5 illustrates another embodiment of the pipeline architecture in accordance with the present invention.

DETAILED DESCRIPTION OF THE EMBODIMENTS

(7) FIGS. 1-3 were already contemplated hereinbefore in connection with the description of both background and historical data relating to the origin of the present invention.

(8) FIG. 4 illustrates one embodiment of the present invention incorporating ESM pipeline architecture 400 with a plurality (N) of long latency units (LLU) 402b as well as other functional units (FU) such as (ordinary or lower latency) ALUs 402, 402c. These other functional units such as ALUs may be positioned before 402 and after 402c the memory access segment 412 and thus, the LLUs 402b.

(9) The layout of the functional units 402, 402b, 402c is merely exemplary in the figure and in other embodiments, the positioning, number and nature/latency thereof may diverge from the illustrated one. The functional units 402, 402b, 402c have been provided with unique identifiers in connection with general identifiers A(LU) and LLU to bring forward the fact that the units 402, 402b, 402c may mutually differ, also within the same general type (A/LLU) in terms of structure and/or functionality. However, at least some of the units 402, 402b, 402c may be mutually similar in terms of structure and/or operation.

(10) IF 408 refers to instruction fetch logic, MEM 412a refers to a single memory unit stage typically lasting for a clock cycle, and OS 406 refers to operand selection logic with register file read/write access actions. SEQ 410 refers to a sequencer.

(11) Generally, the operands are selected by the responsible logic 406 in the beginning of the pipeline according to the corresponding operand selection field(s) in the instruction words. The operands may be passed to the functional units via a number of register pipes.

(12) As mentioned hereinbefore, the long latency units 402b may have been designed for executing more complex operations, e.g. division and application-specific operations, and potentially organized as one or more chains of units residing in parallel with the memory unit wait segment and connected in the middle of the overall ALU chain or pipeline structure 414.

(13) By disposing at least some long latency units 402b functionally and temporally in parallel with the memory access segment 412 incorporating a plurality of memory (wait) stages 412a, as shown in the figure, the execution time of long latency operations may be scaled down to a single ESM step. The LLUs 402b may advantageously execute their tasks simultaneously with the memory access operation.

(14) Considering the execution process of instructions involving LLU(s) in more detail, the LLUs 402b are preferably controlled with dedicated field(s) in the instruction word just like the rest of the functional units in the ESM. Operands for these operations may be selected in the operand select (OS) stage 406 of the pipeline, or they can alternatively be inherited from the results produced by the ALUs 402 residing in the chain before the memory wait segment 412.

(15) The long latency operations are then executed in the order specified by the placement of and/or connections between the LLUs 402b.

(16) Generally, two or more functional units 402, 402b, 402c (mutually similar or different) such as LLUs or e.g. a combination of ALUs 402, 402c and LLUs 402b may be chained together such that data may be passed from one unit to another. The chained functional units may be configured to execute mutually different operations on the input data (operands).

(17) The results of long latency operations can, for example, be used as operands for the rest of the ALUs 402c or sequencer 410. In order to have full throughput, LLUs 402b shall be internally pipelined.

(18) As a result, a programmer can apply up to N long latency operations during a single step of execution. These operations can even be dependent on each other if they are placed into a chain of units accordingly. And what is remarkable, the suggested solution does not generally increase the length of the processor pipeline. Naturally, the executed memory operation shall be independent of the long latency operations executed meanwhile within a step of execution.

(19) The varying physical dimensions of the depicted entities representing functional units such as LLUs 402b and ALUs 402, 402c indicate the fact that the complexity or latency of the LLUs 402b and/or other functional units 402, 402c applied may mutually vary as well. Areas covered by the rectangles/lengths of the rectangles in vertical direction imply the execution time or latency of the corresponding units, i.e. ALUs 402, 402c associated with shorter latency are depicted as shorter/smaller rectangles as the LLUs 402b.

(20) In some embodiments, a number of functional units such as LLUs 402b may be introduced to the pipeline such that more complex/more latency-causing unit(s) are located therein later (considering the execution flow) than the unit(s) of lesser complexity/latency. And with particular reference to the memory access segment 412, the more complex unit(s) may be disposed substantially in parallel with the end portion of the segment preceded by the simpler ones, for example.

(21) FIG. 5 illustrates another embodiment of the pipeline architecture 500 in accordance with the present invention.

(22) Also in this embodiment, IF 408 refers to instruction fetch logic, MEM 412a refers to a single memory unit stage typically lasting for a clock cycle, OS 406 refers to operand selection logic with register file read/write access actions, and SEQ 410 refers to a sequencer.

(23) In this embodiment, the pipeline comprises separate, functionally and logically parallel, branches 500a, 500b of long latency units (LLU) 502a, 502b respectively for performing related long latency operations. A branch 502a, 502b may comprise only one LLU or a plurality of LLUs and optionally one or more other functional units (FU) such as multiple ALUs for carrying out operations such as predetermined arithmetic and logical operations on the data provided thereto. The branches may be of limited length and be preceded and/or followed by common, shared, pipeline segment(s).

(24) The parallel branches 500a, 500b of the pipeline may exist in parallel with the memory access segment 412 only (the visualized case), or extend beyond that, thus potentially also preceding or following the memory access segment 412. In some embodiments, the branches may, on the other hand, define a pipeline segment that is shorter than the memory access segment 412. Accordingly, the actual functional units such as LLUs located in the parallel branches 500a, may also be configured as mutually substantially (functionally/temporally) parallel as indicated in the figure, wherein LLUs 502a, 502b have been depicted as fully parallel relative to the pipeline.

(25) By way of example, latency or complexity of each particular functional unit is again depicted in the figure by the size, or length, of the corresponding block. In the shown embodiment, the branches contain equal number of equally complex (same or similar latency-causing) LLUs 502, but a person skilled in the art certainly realizes that in various other feasible embodiments, a number of LLUs within a branch and/or between branches may be of mutually different complexity/latency. In some embodiments, the LLUs positioned in the branches are selected so that the latencies caused by the parallel branches are substantially equal and/or remain within the overall duration of the memory access segment 412. In some embodiments, LLU with longer latency is disposed later in the pipeline than LLU with shorter latency.

(26) Two or more, optionally all, LLUs 502a, 502b disposed in each branch 500a, 500b may have been chained according to the principles set forth hereinbefore to pass data therebetween, etc. Chaining may increase the obtained performance through exploitation of available virtual instruction-level parallelism.

(27) Generally, the functional units 402, 402b, 402c, 502a, 502b may be controlled by VLIW-style sub-instruction operation fields, for instance. After the target operation has been executed by a functional unit, the result may be made available to the functional unit(s) situated after that unit in the respective chain via elements including e.g. multiplexers controlled by the current instruction word.

(28) Ultimately, a skilled person may, on the basis of this disclosure and general knowledge, apply the provided teachings in order to implement the scope of the present invention as defined by the appended claims in each particular use case with necessary modifications, deletions, and additions, if any. Generally, the various principles set forth herein could be also at least selectively utilized in processor architectures not falling under the ESM definition adopted herein, as being readily understood by the persons skilled in the art.