Reconfigurable computer accelerator providing stream processor and dataflow processor
11853244 ยท 2023-12-26
Assignee
Inventors
- Karthikeyan Sankaralingam (Madison, WI, US)
- Anthony Nowatzki (Madison, WI, US)
- Vinay Gangadhar (Madison, WI, US)
Cpc classification
G06F13/4022
PHYSICS
International classification
G06F9/38
PHYSICS
G06F9/30
PHYSICS
Abstract
A reconfigurable hardware accelerator for computers combines a high-speed dataflow processor, having programmable functional units rapidly reconfigured in a network of programmable switches, with a stream processor that may autonomously access memory in predefined access patterns after receiving simple stream instructions. The result is a compact, high-speed processor that may exploit parallelism associated with many application-specific programs susceptible to acceleration.
Claims
1. A data flow computer architecture comprising: a dataflow processor providing set of functional units and programmable switches interconnecting the functional units between input ports receiving input values and output ports providing output values, the functional units providing programmable arithmetic functions and the interconnection providing paths from input ports through functional units to output ports determined by the switch programming; a clock requiring synchronous movement of data among functional units and programmable switches by one step for each clock cycle, a step being from a functional unit to a switch or from a switch to a functional unit; and a configuration store holding data configuring the interconnection of the functional units and the arithmetic functions of the functional units to execute a predetermined program in which data received at the input ports is clocked through the functional units and programmable switches to the output ports to implement a sequence of arithmetic functions on the data; wherein the functional units operate so that calculations occur as soon as operands are available at the functional units and so that memories for storing operands at the functional units are not required, and wherein the configuration store defines paths of data through the dataflow processor ensuring corresponding operands arrive at the same time at each functional unit according to the program by adjusting the path of data through the dataflow processor without a need for additional buffer storage elements.
2. The dataflow computer architecture of claim 1 further including a set of buffers associated with input ports of the dataflow processor, the buffers synchronized with the clock to release data to the input ports at times adapted to ensure corresponding operands arrive at the same time at each functional unit according to the program and the configuration store.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
(7) Referring now to
(8) The processor 12 may include an L1 cache 14 for communication with a memory system 16 providing a standard memory hierarchy including but not limited to additional levels of cache 18 coupled with one or more layers of larger scale memory 20, for example, composed of random access memory (RAM), disk memory and the like.
(9) The memory system 16 may store a program 22 for execution by the computer system 10 such as may benefit from hardware acceleration, for example, including vision processing, machine learning, graph processing or the like.
(10) The memory system 16 and the processor 12 may communicate with a reconfigurable hardware accelerator 24, for example, by control lines 26 as well as address and data lines 23 allowing the processor 12 to enlist the hardware accelerator 24 for execution of portions of the program 22 amenable to acceleration. Using the control lines 26 and/or data transferred through the memory system 16 by the address and data lines 23, the processor 12 may offload intense calculations having simple memory access patterns to the hardware accelerator 24 for independent execution. In this regard the processor 12 coordinates the beginning and conclusion of that execution but may shut down or be used for other tasks during that calculation. During operation, the hardware accelerator 24 may independently access the memory system 16 at the L2 cache in the manner of a multicore processor autonomously without assistance of the processor 12.
(11) The memory system 16 may include a set of configuration files 25 providing configuration images 27 that will be used to program a specific application-specific calculation to be performed by the hardware accelerator 24 for the desired portions of the program 22. By loading different configuration images 27, different application-specific calculations may be optimized. These configuration images 27 may be developed and standardize for particular applications, for example, to provide different functionalities of conventional application-specific accelerators using the current design of the hardware accelerator 24. Generally the hardware accelerator 24 will be invoked using special instructions that will be described below which may be generated by a compiler.
(12) Referring now also to
(13) Generally, the lightweight core 30 will be a von Neumann, single-issue, in-order core without speculative execution. Nevertheless, this lightweight core 30 will be able to handle a wider range of different types of arithmetic and logical instructions than the dataflow processor 32 and for this reason may be used by the processor 12 for some types of acceleration without involvement of the remainder of the processing unit 28 including, for example, the data flow processor 32. The lightweight core 30 will require much less integrated circuit area than the processor 12 and will use much less power. It will be appreciated that the lightweight core may be any general purpose processor capable of arithmetic and logical functions.
(14) During typical operation, the lightweight core 30 will issue instructions to the stream processor 36 to load a configuration image 27 from the memory system 16 to dataflow processor 32 that will configure the dataflow processor 32 for the necessary calculations. The lightweight core 30 will then issue instructions to the stream processor 36 which in turn will control the memory access interface 40 to obtain information necessary for calculation by the dataflow processor 32 sending this data either directly to the dataflow processor 32 or to a scratchpad memory 34.
(15) The instructions provided by the lightweight core 30 to the stream processor 36 will include: (1) configuration instructions for configuring the dataflow processor 32 by obtaining and loading and appropriate configuration image 27; (2) stream instructions for providing a stream of data to the dataflow processor 32 without involvement of the lightweight core 30 or the processor 12; and (3) barrier instructions used to enforce some degree of serialization of the instructions executed by the processing unit 28 as will be discussed below.
(16) Referring now to
(17) The particular direction of data flow provided by the switch 44 may be determined by a bit value in a mesh 33 configuration switch register 45 associated with the switches 44 determined by a particular configuration image 27 being loaded. The data paths provided by the mesh 33 from an input point 46 through successive switches 44 and functional units 42 to an output point 48 will generally be equal to the width of a computer word, for example, thirty-two or sixty-four bits.
(18) Each of the functional units 42 may implement one of several arithmetic or logical functions but generally fewer functions than provided by the lightweight core 30. For example, a given functional unit 42 may implement one or more of integer or floating-point multiplication, subtraction, addition, etc.; and/or logical functions such as shift, compare, bit wise AND, OR, etc.; and/or special-purpose functions such as sigmoid function, transcendental functions, etc. In addition, the functional units 42 may have a low-power or off state when they are not being used drastically reducing their power consumption. The functions that may be implemented by each functional unit 42 may be different for different functional units 42. This particular function provided by a functional unit 42 is determined by a bit value in a mesh 33 configuration function register 47 associated with each of the functional units 42 as set by a loaded configuration image 27.
(19) While generally the dataflow processor 32 may execute independently from and asynchronously with respect to the lightweight core 30, the data passing through the dataflow processor 32 will be clocked, for example using a self-contained clock element 35, to provide predictable execution. Specifically, data may flow through the mesh 33 of functional units 42 and switches 44 to move generally horizontally and/or downwardly by one step for each clock cycle where a step may be data flow from an input point 46 to a switch 44, or from a switch 44 to a second switch 44, or from a switch 44 to an output point 48, or from a switch 44 to a functional unit 42, or from a functional unit 42 to a switch 44. In this way, the coordination of operands to arrive at functional units 42 as required by a calculation may be controlled by the interposition of switches (or no-op functional units) in the data path in an amount necessary to obtain the desired delay. The necessary routing may be predetermined and incorporated into the configuration image 27 either manually or through use of a special program (such as a complier) for generating configuration images 27.
(20) Generally, the dataflow processor 32 does not provide a program counter or control flow instructions but rather the control flow is determined by the interconnection of the switches 44 and functional units 42. In addition, access to register files or memories by the functional units 42 is not required. Calculations occur as soon as operands are available within the constraint of the clocking which may occur at high speed. The functional units 42 may be implemented with dataflow circuitry or with iterating circuitry operating at sufficient speed to complete calculations within one clock cycle. The dataflow processor 32 thus provides extremely fast calculation.
(21) Each of the input points 46 and output points 48 of the dataflow processor 32 are associated with a first-in, first-out buffer 50 that may be filled asynchronously or emptied asynchronously to the processing performed by the dataflow processor 32 under the control of the stream processor 36. The buffers 50 thus provide for parallel data acquisition and data processing. In one embodiment, each buffer 50 may be provide eight, sixty-four bits words, thus being 864 wide and have an arbitrary depth. The invention also contemplates that the different widths may be employed as desired. Additional similar buffers 50 independent of input points 46 and output points 48 may be used for storing streaming addresses for indirect loads and stores as will be discussed. The input points 46 connect to the respective buffers 50 through an interconnect 41 providing fixed connections allowing given input buffers 50 to communicate with one or more of the first row of switches 44, with each switch 44 receiving data from only one of any of the buffers 50 according to a predefined interconnection pattern. Accordingly, different 64-bit words from a given buffer 50 may be forwarded to different switches 44.
(22) In addition, the output points 48 connect to respective buffers 50 through an interconnect 49 providing fixed connections allowing given output points 48 to connect to one or more output buffers 50, each output buffer receiving data from only one of any of the output points 48 according to a predefined interconnection pattern.
(23) The stream processor 36 provides a simple state machine that can move data autonomously between the memory system 16 and another storage location once it receives program instructions from the lightweight core 30. Generally the stream processor 36 will move input data from the memory system 16 to either the scratchpad memory 34 or the buffers 50, or from the scratchpad memory 34 to the input buffers 50, or may move output data from the scratchpad memory 34 to the memory system 16, or from buffers 50 to the scratchpad memory 34 or the memory system 16 or another buffer 50 according to a predefined pattern. In this regard, the stream processor 36 may provide for three separate circuits, one for memory, one for scratchpad, and one for controlling re-cycling of data from output port to input port and also the generation of constant values. These three circuits may operate independently (but for synchronization through the memory access interface 40) for high speed operation.
(24) The stream processor 36 may also provide for the movement of the data of a configuration image 27 to the mesh 33 configuration registers 45 and 47 of the dataflow processor 32 as is discussed below for configuration.
(25) More specifically, and as mentioned briefly above, the stream processor 36 operates according to configuration instructions, stream instructions, and barrier instructions that may be issued by the lightweight core 30. A configuration instruction format is shown in Table I below.
(26) TABLE-US-00001 TABLE I Configuration Instruction Command Name Parameters Description SD_Config Configuration image Set dataflow processor address, Size configuration from configuration image at address
(27) This instruction provides the stream processor 36 with the starting address and size of a configuration image 27 in the memory system 16 and operates to load the configuration image 27 into the mesh 33 configuration registers 45 and 47. This process will provide the desired configuration of the mesh 33 of the dataflow processor 32 and the functions of the functional units 42 needed for acceleration of the program 22, for example, as triggered by the processor 12 communicating over the control lines 26 to the accelerator 24.
(28) The stream instructions (shown in Table II) provided by the lightweight core 30 to the stream processor 36 generally identify a source of data, destination data, and the data pattern as follows:
(29) TABLE-US-00002 TABLE II Stream Instructions Command Name Parameters Description SD_Mem_Scr Source Memory Read from memory Address, Access system 16 to the Size, Stride scratchpad memory 34 Length, Number of using the indicated Strides, Destination access pattern Scratchpad Address SD_Scr_Port Source Scratchpad Read from scratchpad Address, Access Size, memory 34 to the Stride Length, Number designated input of Strides, Input point 46 using to the Port Number indicated pattern SD_Mem_Port Source Memory Read from memory system Address, Access 16 to the designated Size, Stride input point 46 using the Length, Number of indicated pattern Strides, Input Port Number SD_Const_Port Constant Value, Send a series of constant Number of Elements, values to the designated Destination Port input point 46 Number SD_Chuck_Port Number of Elements, Eject a defined series of Source Port Number values from a buffer 50 of the designated output point 48 SD_Port_Port Source Port Number, Recirculate a defined Number of elements, series of values from the Destination Port designated output point 48 Number to the designated input point 46. SD_Port_Scr Source Port Number, Write a defined series of Number of elements, values from the designated Destination Scratchpad output point 48 to address. scratchpad memory 34 SD_Port_Mem Source Port Number, Write from the designated Access Size, Stride output point 48 to memory Length, Number of system 16 using the Strides, Destination indicated pattern Memory Address. SD_IndPort_Port Indirect Port Number, Indirect load from memory Offset Address, system 16 based on address Destination Port data in designated indirect Number output point 48 for storage in designated destination port SD_IndPort_Mem Indirect Port Number, Indirect store to memory Offset Address, system 16 based on address Destination Port in indirect port from Number designated output port
(30) These instructions transfer data between storage locations autonomously using a designated pattern as will be discussed below.
(31) Indirect addressing is possible using stored data (for example, in a buffer 50) as an address value. In indirect addressing, data, for example, from the streaming pattern, is used as the address to obtain further data that is operated on by the functional units 42. This indirect addressing effects pointers, useful, for example, when accessing the rows of a sparse matrix. The stream processor 36 may provide capability to facilitate indirect access by chaining two streams together, the first stream for accessing a contiguous or strided pattern of pointers, and subsequent streams to load those pointers' values from the memory system 16 and deliver them to the reconfigurable dataflow processor 32. Additional instructions are provided to generate constant values (rather than loading these from memory) and to discard unused output values (as opposed to loading them into nonfunctional memory areas).
(32) Generally each of these instructions may be issued directly by the processor 12 as part of the instruction set architecture of the accelerator and the data in these instructions used with minimal processing by the lightweight core 30 to control other components of the accelerator.
(33) Referring now to
(34) Alternatively, the stream processor 36 may be programmed to use a strided pattern 65 by setting the stride length equal to a nonzero value which describes a gap or stride 66 in addresses between access portions 67 defined by the access size.
(35) Similarly, an overlapped axis pattern 68 may be invoked by setting the access size to greater than the stride size which signals an overlapping pattern. A repeated pattern 69 is easily obtained by setting the stride length to zero with the repetition being provided by the number of strides.
(36) The lightweight core 30 may also provide for barrier instructions to the stream processor 36 which block the issuance of new memory access instructions until certain previous instructions associated with a data storage resource are complete. For example, a barrier instruction (shown in Table III below) associated with a writing to the scratchpad memory 34 will block subsequent writing to the scratchpad memory 34 until all writings to the scratchpad memory 34 before the barrier instruction are completed. Barriers can also be used to signal completion of the calculation to the lightweight core 30.
(37) TABLE-US-00003 TABLE III Barrier Instructions Command Name Parameters Description SD_Bar_Scr_Rd Barrier for Scratchpad Reads SD_Bar_Scr_Wr Barrier for Scratchpad Writes SD_Bar_All Barrier to wait for all commands completion
(38) Referring now to
(39) Upon completion of the addition at functional unit 42b, the output passes to output buffer 50 designated G.
(40) Generally this process will be repeated for multiple data value stored in the input ports A and B. Each multiplication at functional unit 42a being performed concurrently with additions at functional unit 42b in the manner of a pipeline and providing for high throughput.
(41) Referring now to
(42) The first instruction (C1) provides a transfer from memory system 16 to the scratchpad memory 34 of data that will ultimately used to load the buffer 50 of port A. This instruction begins executing immediately after it is enqueued. Is important that the scratchpad memory not be read until it is fully loaded and accordingly the next instruction (C2) provides a scratchpad memory read barrier ensuring that there is no reading of the scratchpad memory 34 until instruction (C1) is complete. Accordingly instruction (C3), which provides a reading of the scratchpad memory 34 into port A, is delayed until completion of instruction (C1) at time 74. The barrier instructions may be simply enforced by stalling subsequent commands from the dataflow processor 32 related to the barrier condition allowing all previous commands to proceed in parallel.
(43) The barrier instruction (C2) does not block instruction (C4) reading memory to the buffer of port B because there is no conflicted resource. Accordingly this transfer process may begin before and continue in parallel with the transfer process of instruction (C3). Similarly transfer from memory system 16 to port D may be performed shortly after this instruction is enqueued.
(44) At time 76 values will be present in each of ports A, B, and D allowing the dataflow processor 32 to begin calculation and these values to be released from their buffers 50. These calculations performed by the dataflow processor 32 will be repeated using successive values in each of the buffers 50 of the input points 46 and provide new calculated values into the output buffers 50 of the output points 48 as indicated by processing cycles 78.
(45) Once the first processing cycle 78 is completed, at time 80, a writing from the output buffer of port G to memory system 16 may begin.
(46) After the conclusion of all processing cycles 78 for the data held in the buffers 50, at time 82, the writing from port G to memory system 16 concludes releasing the barrier of instruction C8 and signaling to the lightweight core 30 that the calculation is complete so that new instructions may be received from the lightweight core 30. The all barrier of instruction C8 is released when all of the buffers 50 are empty detected by hardware.
(47) During the processing of instruction (C4), after loading port B from memory, a second instruction (C7) may be enqueued also loading from memory system 16 to Port B. A natural barrier is created in the circumstances by the stream processor 36 which serializes loading of buffer 50 intrinsically.
(48) Also note that the second stream for instruction (C7) for loading data into buffer B may not have the same access pattern as the previous one. Also, its type (e.g., source or destination) can be different as well. More generally, the stream commands for a given buffer 50 can change while the dataflow architecture and other stream commands are actively being processed. This leads to more programming flexibility and parallelism.
(49) It will be appreciated that substantial parallelism is obtained in this processing provided by the dataflow processor 32 based on the overlapping line segments indicated in
(50) Referring now to
(51) The lightweight core 30 may also expose hardware parameters of the hardware accelerator 24 including a number and type of functional units 42, a depth of buffers 50 and 50, the size of the scratchpad memory 34, and the longest recurrence (recycling) of data through the dataflow processor 32 for use by a compiler as is generally understood in the art.
(52) As used herein, predefined memory access pattern means a limited number of patterns that may be defined prior to the calculation for which the memory accesses require to be performed as opposed to memory access patterns that are a function of calculations made. Autonomous as is used herein means without necessary further guidance by the micro core or the data fabric.
(53) Certain terminology is used herein for purposes of reference only, and thus is not intended to be limiting. For example, terms such as upper, lower, above, and below refer to directions in the drawings to which reference is made. Terms such as front, back, rear, bottom and side, describe the orientation of portions of the component within a consistent but arbitrary frame of reference which is made clear by reference to the text and the associated drawings describing the component under discussion. Such terminology may include the words specifically mentioned above, derivatives thereof, and words of similar import. Similarly, the terms first, second and other such numerical terms referring to structures do not imply a sequence or order unless clearly indicated by the context.
(54) When introducing elements or features of the present disclosure and the exemplary embodiments, the articles a, an, the and said are intended to mean that there are one or more of such elements or features. The terms comprising, including and having are intended to be inclusive and mean that there may be additional elements or features other than those specifically noted. It is further to be understood that the method steps, processes, and operations described herein are not to be construed as necessarily requiring their performance in the particular order discussed or illustrated, unless specifically identified as an order of performance. It is also to be understood that additional or alternative steps may be employed.
(55) References to microcontroller should be understood to include any circuit capable of executing the functions described herein including but not necessarily limited to von Neumann architectures.
(56) It is specifically intended that the present invention not be limited to the embodiments and illustrations contained herein and the claims should be understood to include modified forms of those embodiments including portions of the embodiments and combinations of elements of different embodiments as come within the scope of the following claims. All of the publications described herein, including patents and non-patent publications, are hereby incorporated herein by reference in their entireties.