Hardware architecture for a neural network accelerator
11704535 · 2023-07-18
Assignee
Inventors
- Kumar S. S. Vemuri (Hyderabad, IN)
- Mahesh S. Mahadurkar (Wani, IN)
- Pavan K. Nadimpalli (Gudivada, IN)
- Venkat Praveen K. Kancharlapalli (Vijayawada, IN)
Cpc classification
G06F3/0655
PHYSICS
G06F3/0679
PHYSICS
G06F3/0635
PHYSICS
G06N3/06
PHYSICS
International classification
G06F5/16
PHYSICS
G06N3/06
PHYSICS
Abstract
Examples herein describe hardware architecture for processing and accelerating data passing through layers of a neural network. In one embodiment, a reconfigurable integrated circuit (IC) for use with a neural network includes a digital processing engine (DPE) array, each DPE having a plurality of neural network units (NNUs). Each DPE generates different output data based on the currently processing layer of the neural network, with the NNUs parallel processing different input data sets. The reconfigurable IC also includes a plurality of ping-pong buffers designed to alternate storing and processing data for the layers of the neural network.
Claims
1. An integrated circuit (IC), comprising: a digital processing engine (DPE) array having a plurality of DPEs configured to execute one or more layers of a neural network; reconfigurable integrated circuitry configured to include: an input/output (IO) controller configured to receive input data to be processed by the DPE array based on the one or more layers of the neural network; a feeding controller configured to feed the input data from the IO controller to the DPE array executing the one or more layers of the neural network; a weight controller configured to provide weight parameters used for processing the input data through the one or more layers of the neural network to the DPE array; an output controller configured to receive processed data from the DPE array based on the one or more layers of the neural network; and configurable buffers configured to communicate with the IO controller, the feeding controller, the weight controller, and the output controller to facilitate data processing between the IO controller, the feeding controller and the weight controller by alternating between storing and processing data in the configurable buffers; wherein a first one of the DPEs comprises a neural network unit (NNU) configured to process the input data; and wherein the NNU comprises digital signal processors (DSPs) configured to process the input data at a frequency that is at least double a frequency at which the programmable logic is configured to operate.
2. The IC of claim 1, wherein the DPE array comprises a subset of operating DPEs to process the input data for a layer of a neural network model, wherein each of the subset of operating DPEs is independently configurable.
3. The IC of claim 2, wherein the configurable buffers comprises multiple instances of weight buffers managed by the weight controller, and a number of the multiple instances of the weight buffers corresponds to a number of operating DPEs, wherein each instance of weight buffers comprises weight data for a corresponding operating DPE.
4. The IC of claim 2, wherein the DPE array further comprises output buffers corresponding to a number of the subset of operating DPEs, each output buffer comprising processed data generated from the subset of operating DPEs.
5. The IC of claim 1, wherein the first DPE comprises multiple independently configurable NNUs.
6. The IC of claim 1, wherein the configurable buffers are configured to provide data to the NNU by alternating between a first configurable buffer and a second configurable buffer for data processing by the NNU.
7. The IC of claim 1, wherein the configurable buffers are configured to send data for at least two multiply-accumulate operations in a single clock cycle.
8. The IC of claim 1, wherein the configurable buffers comprise a first IO buffer and a second IO buffer, wherein at a first stage, the IO controller stores the input data into the first IO buffer while the feeding controller processes the input data in the second IO buffer and at a second stage, the IO controller stores the input data into the second IO buffer while the feeding controller processes the input data in the first IO buffer.
9. The IC of claim 1, wherein the configurable buffers comprise a first feeding buffer and a second feeding buffer, wherein at a first stage, the feeding controller stores the input data into the first feeding buffer and the DPE array processes the input data in the second feeding buffer and at a second stage, the feeding controller stores the input data into the second feeding buffer while the DPE array processes the input data in the first feeding buffer.
10. The IC of claim 1, wherein the configurable buffers comprise a first weight buffer and a second weight buffer, wherein at a first stage, the weight controller stores weight data into the first weight buffer and the DPE array processes the weight data in the second weight buffer, and at a second stage, the weight controller stores the input data into the second weight buffer while the DPE array processes the input data in the first weight buffer.
11. The IC of claim 1, wherein the configurable buffers comprise a first output buffer and a second output buffer, wherein at a first stage, the output controller stores the processed data into the first output buffer while the IO controller writes the processed data in the second output buffer to external memory, and at a second stage, the output controller stores the processed data into the second output buffer while the IO controller writes the processed data in the first output buffer to the external memory.
12. The IC of claim 1, wherein: the first DPE comprises multiple NNUs; and the IC is configured by a host to identify first and second subsets of the NNUs to process respective first and second input data sets of the input data, and wherein the first input data set and the second data set are different from each other.
13. A method for operating an integrated circuit (IC), the method comprising: storing input data into a first input buffer of input ping-pong buffers while data stored in a second input buffer of the input ping-pong is processed; transmitting the input data through feeding ping-pong buffers to a digital processing engine (DPE) array by storing the input data into a first feeding buffer of the feeding ping-pong buffers while a data stored in a second feeding buffer of the feeding ping-pong buffers is processed by one or more layers executing in the DPE array to generate output data; storing weight data in a first weight buffer of weight ping-pong buffers while data stored in a second weight buffer of the weight ping-pong buffers is processed by the one or more layers executing in the DPE array; and storing the output data in a first output buffer of output ping-pong buffers while data stored in a second output buffer of the output ping-pong buffers is outputted to a host computing system communicatively coupled to the IC; wherein a first one of the DPEs comprises a neural network unit (NNUs) configured to process the input data; and wherein the NNU comprises digital signal processors (DSPs) configured to process the input data at a frequency that is double a frequency at which the programmable logic is configured to operate.
14. The method of claim 13, wherein during a ProcessStage (PS) state, the DPE array processes the input data to generate the output data and populates a buffer of the output ping-pong buffers with the output data.
15. The method of claim 14, wherein during the PS state, a ReadInput (RI) state begins for one of the input ping-pong buffers, and during RI state, an input/output (IO) controller receives the input data from the interconnect and stores the input data into the one of the input ping-pong buffers.
16. The method of claim 14, wherein during the PS state, a WriteOutput (WO) state begins for another buffer of the output ping-pong buffers, and during the WO state, the output controller outputs the output data in the another buffer of the output ping-pong buffers to the host computing system.
17. The method of claim 14, wherein during the PS state, a LoadFeedingBuffer (LF) state begins for one of the feeding ping-pong buffers, and during the LF state, a feeding controller stores the input data from an IO controller into the one of the feeding ping-pong buffers.
18. The method of claim 14, wherein during the PS state, a ProcessFeedingBuff (PF) state begins for one of the feeding ping-pong buffers, and during the PF state, the DPE array reads the input data in the one of the feeding ping-pong buffers.
19. The method of claim 18, wherein during the PF state, a ReadWeights (RW) state begins for one of the weight ping-pong buffers, and during the RW state, a the weight controller stores the weight data into one of the weight ping-pong buffers; and a Compute-and-Store (CS) state begins for one of the weight ping-pong buffers, and during the CS state, the DPE array reads the weight data in the one of the weight ping-pong buffers.
20. An integrated circuit (IC), comprising: a digital processing engine (DPE) array having a plurality of DPEs configured to execute one or more layers of a neural network; programmable logic comprising: an input/output (IO) controller configured to receive input data to be processed by the DPE array based on the one or more layers of the neural network; a feeding controller configured to feed the input data from the IO controller to the DPE array executing the one or more layers of the neural network; a weight controller configured to provide weight parameters used for processing the input data through the one or more layers of the neural network to the DPE array; an output controller configured to receive processed data from the DPE array based on the one or more layers of the neural network; and configurable buffers configured to communicate with the IO controller, the feeding controller, the weight controller, and the output controller to facilitate data processing between the IO controller, the feeding controller and the weight controller by alternating between storing and processing data in the configurable buffers; wherein the configurable buffers are configured to send data for at least two multiply-accumulate operations in a single clock cycle.
21. An integrated circuit (IC), comprising: a digital processing engine (DPE) array having a plurality of DPEs configured to execute one or more layers of a neural network; a plurality of controllers that include, an input/output (IO) controller configured to receive input data to be processed by the DPE array based on the one or more layers of the neural network, a feeding controller configured to feed the input data from the IO controller to the DPE array executing the one or more layers of the neural network, a weight controller configured to provide weight parameters used for processing the input data through the one or more layers of the neural network to the DPE array, and an output controller configured to receive processed data from the DPE array based on the one or more layers of the neural network; and first and second buffers coupled to an output of a first one of the controllers, wherein the first one of the controllers is configured to write data to the first buffer while data is read out of the second buffer, and to write data to the second buffer while data is read out of the first buffer.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) So that the manner in which the above recited features can be understood in detail, a more particular description, briefly summarized above, may be had by reference to example implementations, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only typical example implementations and are therefore not to be considered limiting of its scope.
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
(19)
(20)
(21)
(22) To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the figures. It is contemplated that elements of one example may be beneficially incorporated in other examples.
DETAILED DESCRIPTION
(23) Various features are described hereinafter with reference to the figures. It should be noted that the figures may or may not be drawn to scale and that the elements of similar structures or functions are represented by like reference numerals throughout the figures. It should be noted that the figures are only intended to facilitate the description of the features. They are not intended as an exhaustive description of the description or as a limitation on the scope of the claims. In addition, an illustrated example need not have the aspects or advantages shown. An aspect or an advantage described in conjunction with a particular example is not necessarily limited to that example and can be practiced in any other examples even if not so illustrated, or if not so explicitly described.
(24) Embodiments herein describe a reconfigurable integrated circuit (IC) with hardware architecture for neural network acceleration. In one embodiment, a user can scale and configure the hardware architecture for the neural network inference accelerator core presented herein. The reconfigurable IC can process and accelerate various layers used in neural networks, including, but not limited to, convolution, max-pool, batch norm, scale, ReLU, fully-connected, and ElementWise layers. Intermediate data and buffers between layers are written and read to and from external memory (e.g., dynamic random access memory (DRAM)). To mitigate the impact of memory access latencies on the overall performance of the reconfigurable IC, the reconfigurable IC also allows for fusion of layers. As used herein, the neural network accelerator core is referred to as a “reconfigurable IC.”
(25) The reconfigurable IC described herein accelerates processing of data passing through layers of neural networks by configuring a configurable number of digital processing engines (DPEs) and corresponding neural network units (NNUs) to process the data in parallel. By processing data in parallel, the reconfigurable IC can more quickly generate output data for the layers of the neural networks. The reconfigurable IC also accelerates data processing by leveraging a ping-pong scheme for the storage structures of the reconfigurable IC. By implementing a ping-pong scheme for the storage structures, the reconfigurable IC hides memory access and data transfer latencies behind concurrent data processing.
(26) One type of reconfigurable IC that may work for processing and accelerating data passing through the layers of neural networks are FPGAs, which have many lookup arrays, available on-chip storage, and digital signal processing units. Using these FPGA components, an exemplary logic hardware design to connect these components for the functionality of different layer types of a neural network is described herein. While the present disclosure discusses a hardware design for processing and accelerating data passing through a neural network, the present disclosure is not limited to neural networks or deep neural networks (DNN) and can include other types of machine learning frameworks.
(27)
(28) The reconfigurable IC 120 includes programmable logic 122 to configure a digital processing engine (DPE) array 130. For example, using a received bitstream that contains configuration data, control logic 150 can configure the programmable logic 122 (which can include a plurality of configurable logic blocks) to use any number of DPEs (132.sub.1-132.sub.N) that have any number of neural network units (NNUs) (134.sub.1-134.sub.N) in each of the DPEs. For example, the programmable logic 122 can be programmed to include look up tables, function generators, registers, multiplexers, and the like. In some embodiments, the programmable logic implements controllers of the reconfigurable IC, which are described in reference to
(29) In
(30) The DPE array 130 of the reconfigurable IC 120 has any number of DPEs (also referred to as kernel processors), and these DPEs of the DPE array 130 perform operations on the input data (e.g., data points of input feature maps) to generate output data (e.g., data points of output feature maps). In one embodiment, based on the configuration data, only a subset of DPEs perform operations on the input data. In some embodiment, each DPE is an array of NNUs 134.sub.1-134.sub.N (also referred to as a pixel processor when the NNUs are used to process pixels in a captured image) and comprises specialized circuitry to connect the array of NNUs 134.sub.1-134.sub.N. Although
(31) NNUs 134.sub.1-134.sub.N process the incoming input data and generate output data for layers of the neural network. In some embodiments, because the DPEs processes input data for a single layer of the neural network at any given time, the NNUs 134.sub.1-134.sub.N of each DPE take in the input data and generate different output data points of the output data for the currently processing layers of the neural network. Further details about the DPEs and the NNUs are provided below.
(32) In some embodiments, NNUs 134.sub.1-134.sub.N comprise non-programmable logic i.e., are hardened specialized processing elements. In such embodiments, the NNUs comprise hardware elements including, but not limited to, program memories, an instruction fetch/decode unit, fixed-point vector units, floating-point vector units, arithmetic logic units (ALUs), and multiply accumulators (MAC). Although the NNUs 134.sub.1-134.sub.N may be hardened, this does not mean the NNUs are not programmable. That is, the NNUs 134.sub.1-134.sub.N can be configured to perform different operations based on the configuration data. In one embodiment, the NNUs 134.sub.1-134.sub.N are identical. That is, each of the NNUs 134.sub.1-134.sub.N may have the same hardware components or circuitry. In other embodiments, the NNUs can comprise digital signal processing engines, cryptographic engines, Forward Error Correction (FEC) engines, or other specialized hardware for performing one or more specialized tasks.
(33) Although
(34)
(35)
(36) Each IFM 302.sub.1-302.sub.N comprises any number of input patches 304.sub.1-304.sub.N, and can have any size (e.g., 3x3). The DPE array 310 contains any number of DPEs (also known as kernels and kernel processors) 312.sub.1-312.sub.N, and each DPE contains filters 314.sub.1-314.sub.N of a size matching the size of the input patch of the IFM (e.g., 3x3). The number of filters 314.sub.1-314.sub.N corresponds to the number of input feature maps. For example, if the convolution layer has four input feature maps, then each DPE would have four filters. The number of OFMs 320.sub.1-320.sub.N corresponds to the number of DPEs 312.sub.1-312.sub.N, and each DPE 312.sub.1-312.sub.N generates an OFM 320.sub.1-320.sub.N.
(37) In this exemplary convolution layer, the DPE array 310 takes in the IFMs 302. Each DPE 312.sub.1-312.sub.N of the DPE array 310 processes each IFM 302.sub.1-302.sub.N. For example, DPE 312.sub.1 takes in and processes IFM 302.sub.1, IFM 302.sub.2, and so on. Also DPE 312.sub.2 takes in and processes IFM 302.sub.1, IFM 302.sub.2, and so on until each DPE 312.sub.1-312.sub.N of the DPE array 310 takes in and processes each IFM 302.sub.1-302.sub.N.
(38) In the processing of each IFM 302.sub.1-302.sub.N, the DPE 312.sub.1-312.sub.N convolves the input patch 304.sub.1-304.sub.N of each IFM 302.sub.1-302.sub.N with each filter 314.sub.1-314.sub.N and accumulates the output with the previous convolution output. For example, where there are four IFM, the DPE 312.sub.1 convolves the input patch 304.sub.1 of the first IFM 302.sub.1 with the filter 314.sub.1, the input patch 304.sub.1 of the second IFM 302.sub.2 with the filter 314.sub.2, the input patch 304.sub.1 of the third IFM 302.sub.3 with the filter 314.sub.3, and the input patch 304.sub.1 of the fourth IFM 302.sub.4 with filter 314.sub.4. In the example, after the second convolution, the DPE 312.sub.1 accumulates the second output with the first; after the third convolution, the DPE array 310 accumulates the third output with the first and second; and after the fourth convolution, the DPE array 310 accumulates the fourth output with the first, second, and third. In the example, each of the four IFMs results in 9 multiply-accumulates, and thus 36 multiple accumulates for 3x3 filter size generate one output data point in the OFM 320.sub.1. To get the next output data point in the OFM 320.sub.1, the DPE 312.sub.1 takes in the next input patch 304.sub.2 of the IFM 302.sub.1 and repeats the above processing.
(39) In embodiments where the neural network involves image processing, the input data comprises pixels of input feature maps and the output data comprises pixels of the output feature maps. In such embodiments, the input data points for processing by the NNUs of the DPE array are pixels from the input images, and the output data points are pixels of the output images.
(40)
(41) In the exemplary embodiment, the reconfigurable IC 400 comprises an internal interconnect 402; programmable logic 122 implementing an IO controller 406, a feeding controller 410, a weight controller 412, and an output controller 420; IO buffers 408, feeding buffers 414, and a DPE array 430. The DPE array 430 comprises weight buffers 416.sub.1-416.sub.N, DPEs 432.sub.1-432.sub.N, and output buffers 418.sub.1-418.sub.N. The examples herein can be useful in any type of hardware architecture of reconfigurable IC.
(42) In the exemplary embodiment, the internal interconnect 402 (e.g., an Advanced Extensible Interface (AXI) interconnect) of the reconfigurable IC 400 connects the reconfigurable IC 400 with the other components of the reconfigurable IC (not illustrated in
(43) In the exemplary embodiment, the IO controller 406 of the reconfigurable IC 400 accesses and receives data, including input image data, IFMs, and/or activation outputs stored in external memory (e.g., DRAM) through the internal interconnect 402. The IO controller 406 stores the data from external memory in the IO buffers 408. The reconfigurable IC 400 partitions the IO buffers 408 into ping-pong buffers: iStage buffers, which hosts input data, and oStage buffers, which hosts output data from the reconfigurable IC 400. Ping-pong buffers (also referred herein as double-buffers) are discussed in further detail below.
(44) In the exemplary embodiment, once the IO controller 406 has stored data in the IO buffers 408, the feeding controller 410 reads the iStage buffers of the IO buffers 408 and populates the feeding buffers 414. The feeding buffers 414 feed the DPE array 430 with the input data.
(45) Similarly, the weight controller 412 reads in weights and parameters (cumulatively referred herein as “weight data”) stored in external memory (e.g., DRAM) and stores this data in weight buffers 416.sub.1-416.sub.N (also referred to as kbuff or kBuff herein) of the DPE array 430. In some embodiments, the weight data includes filters to include in processing the input data by the DPE array.
(46) In the exemplary embodiment, the DPE array 430 performs multiply-accumulates (MAC) operations on the input data from the feeding buffers 414 and the weight data from the weight buffers 416.sub.1-416.sub.N to generate the output data points of the OFMs of layers of the neural network. The DPE array 430 is organized so that the MACs and/or compute operations can, in parallel, process multiple data points across multiple output feature-maps.
(47) The DPE array 430 comprises a plurality of DPEs 432.sub.1-432.sub.N, each DPE comprising a plurality of neural network units (NNUs) 434.sub.1-434.sub.N. In one embodiment, the NNUs within a DPE 432.sub.1-432.sub.N work in parallel on different data points corresponding to one output feature-map. Accordingly, each DPE works on different output feature-maps in parallel. Each DPE write its output to the DPE output buffers 418.sub.1-418.sub.N, each DPE output buffer corresponding to one DPE.
(48) The output controller 420 writes the contents of the DPE output buffers 418.sub.1-418.sub.N to the oStage buffers of the IO buffers 408, and the IO controller 406 writes the contents of the oStage buffers of the IO buffers 408 to external memory (e.g., DRAM) through the internal interconnect 402.
(49) In a further embodiment, the reconfigurable IC 400 comprises two features to optimize fetches by the DPEs. With one feature, the reconfigurable IC 400 decides the burst length of the fetch requests from the weight controller 412 based on the available storage in the weight buffer (both the weight ping buffer 412.sub.1 and the weight pong buffer 412.sub.2) and the size of the filters of the DPEs. The reconfigurable IC 400 calibrates the number of filters of the DPEs that can be pre-fetched with each request from the weight controller 412. The reconfigurable IC 400 then uses the parameters to decide on the burst length of the fetch requests by the DPE. The burst length decides the efficiency of the memory subsystem.
(50) With the second feature, where the available storage in the weight buffers 416.sub.1-416.sub.N stores the corresponding weight data for the DPEs of a layer, the reconfigurable IC 400 fetches the corresponding weight data for the DPEs only once and the suppresses repeated fetches.
(51)
(52) Operations 500 begin, at 502, when the reconfigurable IC receives configuration data. This configuration data can come from a host computer, such as the host computer 102 of
(53) At 504, operations 500 continue with the reconfigurable IC configuring a subset of DPE of an array of DPEs using a host based on the configuration data to process input data for a layer of the neural network model. As mentioned, the reconfigurable IC can have any number of DPEs and any number of NNUs hardwired, and the configuration data allows for subset of the DPEs to be used and for a subset of the NNUs of the subset of DPEs to be used.
(54) At 506, operations 500 continue with the reconfigurable IC configuring a subset of NNUs for each DPE of an array of DPEs using a host based on the configuration data to process a portion of the input data based on the layer of the neural network model. As mentioned, the reconfigurable IC can have any number of DPEs and any number of NNUs hardwired, and the configuration data allows for subset of the DPEs to be used and for a subset of the NNUs of the subset of DPEs to be used.
(55) At 508, operations 500 continue operating each DPE using the configured subset of NNUs of each DPEs. In some embodiments, operations 500 continue with processing, using the selected NNUs of each configured DPE, a portion of the different input data sets to generate a portion of output data for the layer of the neural network model. In such embodiments, the portion of output data from each of the selected NNUs together form the output data.
(56)
(57) Operations 600 begin, at 602, by receiving first data into a ping buffer from a data controller. The data controller can be the IO controller 406, the feeding controller 410, the weight controller 412, or the output controller 420 of the reconfigurable IC 400 of
(58) At 604, operations 600 continue by concurrently processing the first data in the ping buffer while receiving second data into a pong buffer. The ping buffers and the pong buffers are discussed in further detail below.
(59) At 606, operations 600 continue by transmitting the first processed data from step 604 into the ping buffer into a second data controller.
(60) At 608, operations 600 continue by concurrently processing the second data in the pong buffer while receiving a third data into the ping buffer from the data controller.
Example Data Flow of Hardware Architecture of a Neural Network Reconfigurable IC
(61)
(62) In the exemplary embodiment, storage structures (e.g., buffers) of the reconfigurable IC are ping-pong-buffered to allow for processing of one buffer while the IO controller writes to the other buffer or reads from the external memory (e.g., DRAM) to the other buffer. This scheme hides the external memory access latencies and data transfer latencies between on-chip buffers behind compute processes of the reconfigurable IC. This ping-pong-buffering of each storage structure results in a ping buffer and a pong buffer for each storage structure. As illustrated in
(63) In the exemplary embodiment, data first passes through the internal interconnect 702 from external memory (not illustrated in
(64) The IO controller 706 stores the input data in iStage ping buffer 708.sub.1 and iStage pong buffer 708.sub.2. In the exemplary embodiment, when the IO controller 706 stores the input data in these two buffers, the IO controller 706 stores only a subset of the input data into iStage ping buffer 708.sub.1 and stores the rest of the input data into the iStage pong buffer 708.sub.2. In some embodiments, the IO controller 706 is implemented on programmable logic 122.
(65) The feeding controller 710 reads contents of from iStage ping buffer 708.sub.1 and iStage pong buffer 708.sub.2 and passes the contents the feeding ping buffer 714.sub.1 and feeding pong buffer 714.sub.2. In one embodiment, the data from the iStage ping buffer 708.sub.1 can pass to the feeding ping buffer 714.sub.1, and the data from the iStage pong buffer 708.sub.2 can pass to the feeding pong buffer 714.sub.2. In one embodiment, the data from the iStage ping buffer 708.sub.1 can pass to the feeding pong buffer 714.sub.2, or the data from the iStage pong buffer 708.sub.2 can pass to the feeding ping buffer 714.sub.1. In some embodiments, the feeding controller 710 is implemented on programmable logic 122.
(66) The reconfigurable IC multiplexes the contents of the feeding ping buffer 714.sub.1 and feeding pong buffer 714.sub.2 via a feeding multiplexer 722 for the DPE array 730. The feeding multiplexer 722 passes to the DPE array 730 the contents of one of the ping-pong buffers thereby emptying the buffer while withholding the contents of the other, and then while reconfigurable IC fills the emptied buffer of the ping-pong buffers, the feeding multiplexer 722 passes on to the DPE array 730 the contents of the other buffer. This alternating multiplexing pattern continues between the LF state and the PF state, discussed in further details below.
(67) In the exemplary embodiment, when the feeding controller 710 transmits the input data to the feeding ping buffer 714.sub.1 and feeding pong buffer 714.sub.2, the weight controller 712 receives weight data (including parameter data) from external memory through the internal interconnect 702. Like with the other controllers and buffers, the weight controller 712 feeds the weight data to the weight ping buffer 716.sub.1 and the weight pong buffer 716.sub.2. In some embodiments, the weight controller 712 is implemented on programmable logic 122.
(68) The reconfigurable IC then multiplexes the weight data via a weight multiplexer 724 and passes the data to the DPE array 730. The weight multiplexer 724 acts in a similar fashion as the feeding multiplexer 722 with an alternating multiplexing pattern between the RW state and the CS state.
(69) The DPE array 730 takes in the input data from the feeding multiplexer 722 and the weight data from the weight multiplexer 724 and performs computations on the input data and the weight data to generate output data. In one embodiment, the DPE array 730 generates data points of the output feature-maps as output data. The DPE array 730 stores the output data in the output buffers comprising the oStage ping buffer 718.sub.1 and oStage pong buffer 718.sub.2, which hosts the output data to be sent to external memory via the internal interconnect 702. In the exemplary embodiment, the DPE array comprises N number of DPEs, and each DPE comprises M number of NNUs.
(70) In one embodiment, because of ping-pong-buffering, the oStage buffers results in the oStage ping buffer 718.sub.1 and oStage pong buffer 718.sub.2. The reconfigurable IC multiplexes the output data in the oStage ping buffer 718.sub.1 and oStage pong buffer 718.sub.2 via an output multiplexer 726 to pass to the output controller 720. The output multiplexer 726 acts in a similar fashion as the feeding multiplexer 722 and the weight multiplexer 724 with the alternating multiplexer pattern in the PS state and the WO state. The output controller 720 transmits the accumulated output data to external memory via the internal interconnect 702. In some embodiments, the output controller 720 is implemented on programmable logic 122.
(71) In one embodiment, the reconfigurable IC can configure its buffers (e.g., iStage buffers and oStage buffers) using various design and performance requirements. Accordingly, based on the available internal storage on the reconfigurable IC, the reconfigurable IC may be unable to store the rows of the IFMs and OFMs in internal storage (e.g., iStage buffers and oStage buffers). Where the reconfigurable IC cannot store the rows of the IFMs and the OFMs, the reconfigurable IC generates the entire set of OFMs in multiple iterations using the data-flow described herein. In each iteration, the reconfigurable IC fetches only a few rows of the IFMs and thereby generates partial rows of output. The reconfigurable IC then writes these partial rows of the OFMs to external memory. To mitigate the impact of memory access latencies due to the iterative approach, the reconfigurable IC uses the hierarchical double-buffering (ping-pong) scheme. Therefore, the storage structures of the reconfigurable IC are ping-pong-buffered. As mentioned, in this ping-pong scheme, either the ping structure is processed and the pong structure is busy with memory accesses or the ping structure is busy with memory accesses and the pong structure is processed.
(72) In an example data flow of the hardware architecture of a neural network reconfigurable IC,
(73) Based on the number of output rows that the oStage buffers can hold and certain other parameters (e.g., filter size and filter stride), the host computer programs the reconfigurable IC with the number of rows of the IFM to be fetched by the IO controller 706 into the iStage ping buffer 708.sub.1 instance or the iStage pong buffer 708.sub.2 instance.
(74) In one embodiment, the feeding controller 710 then fetches a rectangular block of data points in the IFMs needed for processing 32 output pixels (assuming there are 32 NNUs and each NNU generates an output pixel) across the OFMs and loads the rectangular block of data points into the feeding buffer instance of each NNU. Because there are 32 NNUs assumed in the hardware configuration, there are 32 feeding buffer instances or 16 feeding buffer instances if the buffer instances are dual-ported.
(75) While the feeding controller 710 loads the data into the feeding buffers, the weight controller 712 fetches weight data corresponding to the first four OFMs and loads the data into the weight ping buffer 716.sub.1 instance. In one embodiment, the weight ping buffer 716.sub.1 and weight pong buffer 716.sub.2 are organized in banks based on the number of DPEs configured for processing the neural network.
(76) The DPE array 730 then reads and processes the contents of the feeding ping buffer 714.sub.1 instance and the weight ping buffer 716.sub.1 to generate the first 32 data points (e.g., pixels) of the first four OFMs (OFM0, OFM1, OFM2, and OFM3). For example, the OFMs 320.sub.1-320.sub.n stored in the oStage ping buffer 718.sub.1 include the first four OFMs (OFM0, OFM1, OFM2, and OFM3). In the example data flow, each DPE processes the input data to generate the first 32 data of its corresponding OFMs, i.e., the first DPE generates OFM0, OFM4, OFM8, and OFM12; the second DPE generates OFM1, OFM4, OFM9, and OFM13; and so on.
(77) While the DPE array processes the first 32 data points of the first 4 OFMs, the weight controller 712 fetches the weight data for the next four OFMs into the weight pong buffer 716.sub.2. In the C state, the DPE array 730 then reads the feeding ping buffer 714.sub.1 instance again and processes it using the data in the weight pong buffer 716.sub.2 to generate the first 32 data points of the OFM4, OFM5, OFM6, and OFM7. Like with the previous OFMs, the OFMs 320.sub.1-320.sub.n stored in the oStage ping buffer 718.sub.1 include the second four OFMs (OFM4, OFM5, OFM6, and OFM7).
(78) While the DPE array 730 processes with the first 32 data points of OFM4, OFM5, OFM6, and OFM7, the weight controller 712 fetches the weight data for the next four OFMs in the weight ping buffer 716.sub.1 instance. In the C state, the DPE array then processes the feeding ping buffer 714.sub.1 instance again with the contents of the weight ping buffer 716.sub.1 to generate the first 32 data points of OFM8, OFM9, OFM10, and OFM11. Like with the previous OFMs, the OFMs 320.sub.1-320.sub.n stored in the oStage ping buffer 718.sub.1 include the third four OFMs (OFM8, OFM9, OFM10, and OFM11).
(79) Similar to the above steps, the DPE array generates the first 32 data points of OFM12, OFM13, OFM14, and OFM15 using the contents of the feeding ping buffer 714.sub.1 instance and the weight pong buffer 716.sub.2. Like with the previous OFMs, the OFMs 320.sub.1-320.sub.n stored in the oStage pong buffer 718.sub.2 include the fourth four OFMs (OFM12, OFM13, OFM14, and OFM15).
(80) The DPE array 730 repeats the previous read-and-process operations on the contents of the feeding pong buffer 714.sub.2 instance to generate the next 32 data points of the 16 OFMs. Also, while the DPE array 730 repeats the previous read-and-process operations, the output controller 720 reads the contents of the oStage ping buffer 718.sub.1 instance and writes the contents out to external memory over the internal interconnect 702.
(81) Once the second set of 64 data points for the 16 OFMs are generated and written to the oStage pong buffer 718.sub.2, the output controller writes the content of the oStage pong buffer 718.sub.2 to the external memory over the internal interconnect 702.
(82) In some embodiment, the reconfigurable IC can configure the storage structures, the number of DPEs, and the number of NNUs per DPE. This configurability scales down to smaller configurations based on the performance and area requirements. The reconfigurable IC can configure the depth of various buffers, such as the iStage buffers and the oStage buffers, to 1k, 2k, 4k, and 8k. The reconfigurable IC can also configure the depth of the weight buffers to 2k and 4k. The reconfigurable IC can configure the number of DPEs and can therefore have to 4, 8, or 16 DPEs. Additionally, the reconfigurable IC can configure the number of NNUs per DPE, such that the reconfigurable IC comprises 8, 16, 32, 40, 48, 56, or 64 NNUs per DPE. In one embodiment, the reconfigurable IC can comprise any combination of the above configurations.
(83)
(84) In the exemplary embodiment, during the RI state 802, the IO controller, such as the IO controller 706 from
(85)
(86) During the RI state 802, the reconfigurable IC fetches a few rows of the IFMs from external memory and writes these rows to the iStage ping Buffer. After this, the PS state 804 begins and the DPE array, such as the DPE array 730 of
(87)
(88) In the exemplary embodiment, during the LF state 902, the feeding controller loads the contents of the iStage buffers (either iStage ping buffer 708.sub.1 and iStage pong buffer 708.sub.2) into the feeding buffers ping buffer. After the LF state 902 in which the feeding controller writes data into the feeding ping buffer, the PF state 904 comprises the DPE array reading data from the feeding ping buffer for processing. Also, while the reconfigurable IC processes the feeding ping buffer in the PF state 904, the feeding controller loads the feeding pong buffer in the LF state 902. This cycle between the LF state 902 and the PF state 904 of the ping and pong buffers continues until the number of data points is equal to the number of partial output rows to be generated by the reconfigurable IC multiplied by the height of the OFM to be generated.
(89) As with
(90)
(91)
(92) In one embodiment, the weight data for the OFMs can fit in the weight buffers (both the weight ping buffer and the weight pong buffer), the weight controller fetches the data once from external memory thereby saving on memory latencies incurred due to repeated fetches of weight data from memory.
(93) As with
(94)
(95) As with the previous figures, using the ping-pong buffering scheme, the states of the ping buffers and the pong buffers are mutually exclusive so that both the ping buffers and the pong buffers are not the same state. For example, the ping buffer cannot be in the C state 1102 at the same time as the pong buffer, as illustrated in the data flow 1110.
Example Neural Network Unit of Neural Network Reconfigurable IC
(96)
(97) In the exemplary embodiment, the double pumped DSP scheme doubles the throughput of the reconfigurable IC. One configuration of the DSP48E2 hard-macros allows for the performance of 2 MACs at 6b fixed-point precisions (both input data and weight data at 6b precision). Accordingly, this int6 scheme overlaid on the double-pump scheme quadruples the throughput of the reconfigurable IC.
(98)
Example Data Organization for Neural Network Reconfigurable IC
(99)
(100)
(101) In the exemplary embodiment, the reconfigurable IC organizes the iStage buffers as eight banks of storage with each bank consisting of four sub-banks. In one embodiment, the iStage Buffer can be viewed as an 8x4 set of block or bridging random access memory (BRAM) on an IC. The reconfigurable IC configures the depth of each sub-bank based on the performance and area requirements. Because the output element in an OFM requires input elements across the IFMs, the reconfigurable IC fetches a set of rows across the IFMs into the iStage buffers. Depending on the size of the iStage buffer and the number of IFMs and the resolution of the IFMs, the set of row can represent a subset of rows in the IFMs or the complete IFMs.
(102) In an exemplary embodiment, because the input to the first layer in a neural network is typically an image which comprises 4 planes (IFMs), the data organization in the iStage buffers for the first layer is different. For the first layer, the reconfigurable IC files the first bank of RAMs with few rows of the input image with R, G, B planes (IFMs) residing in separate sub-banks. The fourth sub-bank is loaded with zeros. The second bank of RAMs is filled with the next few rows of the input image and so on.
(103) In one embodiment, the organization of the oStage buffers, such as the oStage ping buffer 718.sub.1 and oStage pong buffer 718.sub.2 of
(104)
(105) In the exemplary data organization 1600, “P.sub.n” represents the nth-plane (IFM), “E” represents an element within a plane, “W” represents the width of an IFM, and “NP” represents the number of NNUs configured for each DPE of the reconfigurable IC.
(106) In one exemplary embodiment, the reconfigurable IC has configured banks 1602.sub.1-1602.sub.N to have dual-ported sub-banks. Bank 1602.sub.1 feeds the zero-th NNU and the NP/2-th NNU. Bank 1602.sub.2 feeds the first NNU and the NP/2+1-th NNU. This pattern continues until at the end, Bank 1602.sub.P-1 feeds a NP/2-1-th NNU and the NP-1-th NNU.
(107)
(108)
(109) This configuration shown with DPE 1832.sub.2 and DPE 1832.sub.3 can extend to a batch size of 8 as the iStage buffers and the oStage buffers are organized as eight banks.
(110)
(111) The exemplary data flow 1900 illustrates that RI state, the PS state, and the WO state for a few exemplary images, such as Image 0 (“*_0”), Image 1 (“*_1”), and Image 2 (“*_2”).
(112)
(113) In some FPGAs, each programmable tile can include at least one programmable interconnect element (“INT”) 43 having connections to input and output terminals 48 of a programmable logic element within the same tile, as shown by examples included at the top of
(114) In an example implementation, a CLB 33 can include a configurable logic element (“CLE”) 44 that can be programmed to implement user logic plus a single programmable interconnect element (“INT”) 43. A BRAM 34 can include a BRAM logic element (“BRL”) 45 in addition to one or more programmable interconnect elements. In one embodiment, the BRAM 34 is a part of memory 140 which can retain stored data during reconfigurations as described above. Typically, the number of interconnect elements included in a tile depends on the height of the tile. In the pictured example, a BRAM tile has the same height as five CLBs, but other numbers (e.g., four) can also be used. A DSP block 35 can include a DSP logic element (“DSPL”) 46 in addition to an appropriate number of programmable interconnect elements. An IOB 36 can include, for example, two instances of an input/output logic element (“IOL”) 47 in addition to one instance of the programmable interconnect element 43. As will be clear to those of skill in the art, the actual IO pads connected, for example, to the IO logic element 47 typically are not confined to the area of the input/output logic element 47.
(115) In the pictured example, a horizontal area near the center of the die (shown in
(116) Some FPGAs utilizing the architecture illustrated in
(117) Note that
(118) In the preceding, reference is made to embodiments presented in this disclosure. However, the scope of the present disclosure is not limited to specific described embodiments. Instead, any combination of the described features and elements, whether related to different embodiments or not, is contemplated to implement and practice contemplated embodiments. Furthermore, although embodiments disclosed herein may achieve advantages over other possible solutions or over the prior art, whether or not a particular advantage is achieved by a given embodiment is not limiting of the scope of the present disclosure. Thus, the preceding aspects, features, embodiments and advantages are merely illustrative and are not considered elements or limitations of the appended claims except where explicitly recited in a claim(s).
(119) As will be appreciated by one skilled in the art, the embodiments disclosed herein may be embodied as a system, method or computer program product. Accordingly, aspects may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
(120) Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium is any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus or device.
(121) A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
(122) Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
(123) Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user’s computer, partly on the user’s computer, as a stand-alone software package, partly on the user’s computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user’s computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
(124) Aspects of the present disclosure are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments presented in this disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
(125) These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
(126) The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
(127) The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various examples of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
(128) While the foregoing is directed to specific examples, other and further examples may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.