METHOD FOR IMPLEMENTING A HARDWARE ACCELERATOR OF A NEURAL NETWORK
20230004775 · 2023-01-05
Assignee
Inventors
- Alban BOURGE (LES CLAYES SOUS BOIS, FR)
- Meven MOGNOL (LES CLAYES SOUS BOIS, FR)
- Emrick SINITAMBIRIVOUTIN (LES CLAYES SOUS BOIS, FR)
Cpc classification
International classification
Abstract
The invention relates to a method for implementing a hardware accelerator for a neural network, comprising: a step of interpreting an algorithm of the neural network in binary format, converting the neural network algorithm in binary format into a graphical representation, selecting building blocks from a library of predetermined building blocks, creating an organization of the selected building blocks, configuring internal parameters of the building blocks of the organization so that the organization of the selected and configured building blocks corresponds to said graphical representation; a step of determining an initial set of weights for the neural network; a step of completely synthesizing the organization of the selected and configured building blocks on the one hand, in a preselected FPGA programmable logic circuit (41) in a hardware accelerator (42) for the neural network, and on the other hand in a software driver for this hardware accelerator (42), this hardware accelerator (42) being specifically dedicated to the neural network so as to represent the entire architecture of the neural network without needing access to a memory (44) external to the FPGA programmable logic circuit (41) when passing from one layer to another layer of the neural network, a step of loading (48) the initial set of weights for the neural network into the hardware accelerator (42).
Claims
1. A method for implementing a hardware accelerator for a neural network, comprising: interpreting an algorithm of the neural network algorithm in binary format; converting the neural network algorithm in binary format (25) into a graphical representation by: selecting building blocks from a library (37) of predetermined building blocks; creating (33) an organization of the selected building blocks; and configuring internal parameters of the building blocks of the organization; where the organization of the selected and configured building blocks corresponds to said graphical representation, determining an initial set (36) of weights for the neural network, completely synthesizing (13, 14) the organization of the selected and configured building blocks on the one hand in a preselected FPGA programmable logic circuit (41) in a hardware accelerator (42) for the neural network and on the other hand in a software driver for the hardware accelerator (42), the hardware accelerator (42) being specifically dedicated to the neural network so as to represent an entire architecture of the neural network without needing access to a memory (44) external to the FPGA programmable logic circuit (41) when passing from one layer (71 to 75) to another layer (72 to 76) of the neural network; and loading (48) the initial set of weights for the neural network into the hardware accelerator (42).
2. The method for implementing a hardware accelerator for a neural network according to claim 1, further comprising, before the interpretation step (6, 30)), binarizing (4, 20) of the neural network algorithm, including an operation of compressing a floating point format to a binary format.
3. The method for implementing a hardware accelerator for a neural network according to claim 1, further comprising, before the interpretation step (6, 30), selecting from a library (8, 37) of predetermined models of neural network algorithms already in binary format.
4. The method for implementing a hardware accelerator for a neural network according to claim 1, wherein the internal parameters comprise a size of the neural network input data.
5. The method for implementing a hardware accelerator for a neural network according to claim 1, wherein the neural network is convolutional, and the internal parameters also comprise sizes of the convolutions of the neural network.
6. The method for implementing a hardware accelerator for a neural network according to claim 1, wherein the neural network algorithm in binary format (25) is in an ONNX format.
7. The method for implementing a hardware accelerator for a neural network according to claim 1, wherein the organization of the selected and configured building blocks is described by a VHDL code (15) representative of an acceleration kernel of the hardware accelerator (42).
8. The method for implementing a hardware accelerator for a neural network according to claim 1, wherein the synthesizing step (13, 14) and the loading step (48) are carried out by communication between a host computer (46, 47) and an FPGA circuit board (40) including the FPGA programmable logic circuit (41), this communication advantageously being carried out by means of an OpenCL standard through a PCI Express type of communication channel (49).
9. The method for implementing a hardware accelerator for a neural network according to claim 1, wherein the neural network is a neural network configured for an application in computer vision.
10. The method for implementing a hardware accelerator for a neural network according to claim 9, wherein the application in computer vision is an application in a surveillance camera, or an application in an image classification system, or an application in a vision device embedded in a motor vehicle.
11. A circuit board (40) comprising: an FPGA programmable logic circuit (41); a memory external to the FPGA programmable logic circuit (44); and a hardware accelerator for a neural network that is fully implemented in the FPGA programmable logic circuit (41), and specifically dedicated to the neural network so as to be representative of an entire architecture of the neural network without requiring access to a memory (44) external to the FPGA programmable logic circuit when passing from one layer (71 to 75) to another layer (72 to 76) of the neural network, the hardware accelerator comprising: an interface (45) to the external memory; an interface (49) to an exterior of the circuit board; and an acceleration kernel (42) successively comprising: an information reading block (50); an information serialization block (60) with two output channels (77, 78) including a first output channel (77) to send input data to the layers (70-76) of the neural network, and a second output channel (78) to configure weights at the layers (70-76) of the neural network; the layers (70-76) of the neural network; an information deserialization block (80); and an information writing block (90).
12. The circuit board according to claim 11, wherein the information reading block (50) comprises a buffer memory (55), and the information writing block (90) comprises a buffer memory (95).
13. An embedded device, comprising a circuit board according to claim 11.
14. The embedded device according to claim 13, wherein the embedded device is an embedded device for computer vision.
Description
BRIEF DESCRIPTION OF DRAWINGS
[0054]
[0055]
[0056]
[0057]
DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION
[0058]
[0059] The architecture has three layers: the models layer 1, the software stack layer 5, and the hardware stack layer 8.
[0060] The models layer 1 comprises a library 2 of binary models of neural network algorithms already in ONNX format, and a set 3 of models of neural network algorithms which are pre-trained but in a floating point format (32 bit), including in particular TENSORFLOW, PYTORCH, and CAFFEE2. The method for implementing a hardware accelerator for a neural network has two possible inputs: either a model already present in the library 2 or a model conventionally pre-trained in a non-fixed software architecture format (“framework”) belonging to the set 3. For a software architecture to be easily compatible with this implementation method, it is interesting to note that a converter of this software architecture to the ONNX format exists, ONNX being a transverse representation in all software architectures.
[0061] This set 3 of models of pre-trained neural network algorithms but in a floating point format can be binarized by a binarizer 4, possibly equipped with an additional function of re-training the neural network; for example, for PYTORCH, transforming the neural network algorithm models from a floating point format to a binary format, preferably to the binary ONNX format.
[0062] The hardware stack layer 8 comprises a library 9 of components, more precisely a library 9 of predetermined building blocks which will be selected and assembled together and each being parameterized, by the software stack layer 5 and more precisely by the constructor block 6 of the software stack layer 5.
[0063] The software stack layer 5 comprises, on the one hand, the constructor block 6 which will generate both the hardware accelerator for the neural network and the software driver for this hardware accelerator for the neural network, and on the other hand the driver block 7 which will use the driver software to drive the hardware accelerator for the neural network.
[0064] More precisely, the constructor block 6 comprises several functions, including: a graph compilation function which starts with a binarized neural network algorithm; a function for generating code in VHDL format (from “VHSIC Hardware Description Language”, VHSIC meaning “Very High Speed Integrated Circuit”), this code in VHDL format containing information both for implementing the hardware accelerator for the neural network and for the driver software of this hardware accelerator; and a synthesis function enabling actual implementation of the hardware accelerator for the neural network on an FPGA programmable logic circuit. The two inputs of the method for implementing the neural network accelerator come together in the constructor block step 6, which will study the neural network input algorithm and convert it into a clean graph representation. After this conversion of the graph, two products are therefore generated: the VHDL code describing the hardware accelerator including the acceleration kernel as well as the driver software of this hardware accelerator, which remains to be synthesized using synthesis tools, plus the corresponding neural network configuration weights.
[0065] More precisely, the driver block 7 comprises several functions including: a function of loading the VHDL code, a programming interface, and a function of communication between the host computer and the FPGA programmable logic circuit based on the technology of the OpenCL (“Open Computing Language”) software infrastructure. Once the hardware accelerator has been synthesized on the chosen target as an FPGA programmable logic circuit, for example using a suite of tools specific to the manufacturer of the FPGA programmable logic circuit, the driver block 7, which incorporates an application programming interface (API), for example in the Python programming language and the C++ programming language, is used to drive the hardware accelerator. Communication between the host computer and the FPGA is based on OpenCL technology, which is a standard.
[0066] Consequently, great freedom is offered to the user, who can create his or her own program after the generation of the acceleration kernel and the configuration. If the user wishes to target a particular FPGA programmable logic circuit not provided for by the method for implementing a hardware accelerator for a neural network according to the invention, it is still possible, in fact it is sufficient, that this type of model of an FPGA programmable logic circuit be supported by the suite of tools from the vendor of this FPGA programmable logic circuit.
[0067] One of the advantageous features of the method for implementing a hardware accelerator for a neural network proposed by the invention is to be compatible with complex neural network structures such as “ResNet” (for “Residential Network”) or “GoogLeNet”. These neural networks have the distinctive feature of divergent data paths, which are then merged or not merged according to various techniques (an “elt-wise” layer being the most common, for “element-wise”).
[0068] The graph compiler located in the constructor block 6 recognizes these features and translates them correctly into a corresponding hardware accelerator architecture.
[0069]
[0070] A model 12 of a convolutional neural network algorithm (CNN), for example in a TENSORFLOW, CARTE, or PYTORCH format, is transformed into a model 10 of a binarized convolutional neural network algorithm in ONNX format which is sent to an input of a training block 20. A set 11 of training data is sent to another input of this training block 20 to be transformed into trained weights 23 by interaction with a description 22 of the neural network. A conversion by internal representation 21 is made from the model 10 of the binarized convolutional neural network algorithm in ONNX format to the description 22 of the neural network which by interaction on the set 11 of training data gives the trained weights 23 which will be sent to an input of the block 30 of the FPGA toolkit. After this, the description 22 of the neural network is again converted by internal representation 24 to a binarized convolutional neural network algorithm 25 in ONNX format which in turn will be sent to another input of the FPGA toolkit block 30.
[0071] The binarized convolutional neural network algorithm 25 in ONNX format is converted by internal representation 32 and transformed by the cooperation of the construction function 33 and a data converter 34 having received the trained weights 23, in order to output an instantiation 35 of files (as “.vhd”) and a set of weights 36 (as “.data”), all using libraries 37 in the C and C++ programming languages. The data converter 34 puts the training weights in the proper format and associates them, in the form of a header, with guides, in order to reach the correct destinations in the correct layers of the neural network. The internal representation 32, the construction function 33, and the data converter 34 are grouped together in a sub-block 31.
[0072] At the output from the FPGA tool kit block 30, the pair formed by the instantiation 35 of the files and by the set of weights 36, can then either be compiled by an FPGA compiler 14, which can however take a considerable amount of time, or where appropriate can be associated with an already precompiled model in an FPGA precompiled library 13, which will be much faster but of course requires that this pair correspond to an already precompiled model which exists stored in the FPGA precompiled library 13. The obtained result, whether it comes from the FPGA precompiled library 13 or from the FPGA compiler 14, is an FPGA configuration stream 15.
[0073]
[0074] A host computer integrating both a host processor 46 and a random access memory 47 (RAM), storing the data 48 required for the hardware accelerator for the neural network, communicates bidirectionally by means of a serial local bus 49, advantageously of the PCIe type (for “PO Express”, with PCI for “Peripheral Component Interconnect”), with the FPGA circuit board 40 implementing the hardware accelerator for the neural network, and in particular its acceleration kernel 42.
[0075] The FPGA circuit board 40 comprises an FPGA chip 41. This FPGA chip 41 houses the acceleration kernel 42 as well as a BSP interface 43 (for “Board Support Package”). The FPGA chip 41, and in particular the acceleration kernel 42, communicates with a memory 44 integrated onto the FPGA circuit board 40 via a DDR bus 45. The memory 44 is a memory internal to the FPGA circuit board 40, but external to the FPGA electronic chip 41; it has a high speed. This memory 44 is advantageously a memory of the DDR or DDR-2 type (in fact DDR SDRAM for “Double Data Rate Synchronous Dynamic Random Access Memory”).
[0076] When passing from one layer to another layer in the neural network, in the invention, neither the memory 47 external to the FPGA circuit board 40, nor the memory 44 internal to the FPGA circuit board 40 but external to the FPGA chip 41, are read from in order to load part of the hardware accelerator, unlike the prior art. Indeed, for the invention, the entire architecture of the neural network is loaded all at once at the start into the acceleration kernel 42 of the FPGA chip 41, while for the prior art, each layer is loaded separately after the use of the previous layer which it will then replace, requiring an exchange time and volume between the FPGA chip 41 and to the exterior of this FPGA chip 41 that are much greater than those of the invention for the same type of operation of the implemented neural network, therefore offering a much lower operating efficiency than that of the invention. It is because it is specifically dedicated to the neural network that the hardware accelerator can be loaded all at once; conversely, in the prior art, the hardware accelerator is general-purpose, and it must then be loaded layer by layer in order to “reprogram” it for each new layer, a loading all at once not being possible in the prior art without resorting to a very large size for the hardware accelerator. In the specific (and dedicated) hardware accelerator of the invention, the topology is multi-layered, which allows it, to be entirely implemented all at once without requiring too large of a size for the hardware accelerator, while in the prior art, the general-purpose hardware accelerator implements different topologies, one topology for each layer.
[0077]
[0078]
[0079] The acceleration kernel 42 communicates with the BSP interface 43 (also based on the “OpenCL” communication standard), this communication being represented more specifically in
[0080] The acceleration kernel 42 successively comprises, in series, firstly the reading unit 50, then the serialization block 60, then the layers 70 of the neural network itself, then the deserialization block 80, and finally the writing block 90. The signals reach the reading unit 50 via the read interface 52, and exit from their source writing block 90 via the write interface 92 by passing successively through the serialization block 60, layers 70 of the neural network, and the deserialization block 80. Packet management is ensured from start to end, from packet management 54 in the reading unit 50 to packet management 94 in the writing block 90, travelling successively (dotted lines) through the serialization block 60, the layers 70 of the neural network, and the deserialization block 80.
[0081] The reading unit 50 comprises a read interface 52 at its input, and comprises at its output a line 53 for sending input data (for the next serialization block 60) confirmed as ready for use. The reading unit 50 comprises a buffer memory 55 including registers 56 and 57 respectively receiving the external parameters “pin” and “iter_i”.
[0082] The serialization block 60 transforms the data 53 arriving from the reading unit 50 into data 65 stored in registers 61 to 64, for example in 512 registers although only 4 registers are represented in
[0083] The layers 70 of the neural network implement the multilayer topology of the neural network; here only 6 layers 71, 72, 73, 74, 75 and 76 are shown, but there may be more, or even significantly more, and also slightly fewer. Preferably, the neural network comprises at least 2 layers, more preferably at least 3 layers, even more preferably at least 5 layers, advantageously at least 10 layers. It is preferably a convolutional neural network.
[0084] The deserialization block 80 stores in registers 81 to 84 the data 87 arriving by the inference path 77, for example in 512 registers although only 4 registers are represented in
[0085] The writing block 90 at its output comprises a write interface 92, and at its input comprises a line 93 for receiving the output data (from the previous deserialization block 80) confirmed ready to be transmitted to outside the acceleration kernel 42. The writing block 90 comprises a buffer memory 95 including registers 96 and 97 respectively receiving the external parameters “pout” and “iter_o”.
[0086] An example of a possible use of the method for implementing a hardware accelerator for a neural network according to the invention is now presented. A user will offload the inference of a “ResNet-50” type of network from a general-purpose microprocessor of the central processing unit type (CPU) to a more suitable hardware target, particularly from an energy performance standpoint. This user selects a target FPGA programmable logic circuit. He or she can use a pre-trained model of a neural network algorithm in a format such as “PyTorch”, which can be found on the Internet. This model of a neural network algorithm contains the configuration weights in a floating point representation of the neural network trained on a particular data set (“CIFAR-10” for example). The user can then select this model of a neural network algorithm in order to use the method for implementing a hardware accelerator for a neural network according to the invention. The user will then obtain an FPGA project as output, which the user will then synthesize before passing it on to a circuit board, as well as a binarized configuration compatible with the binary representation of the hardware accelerator for the neural network. This step will require the installation of proprietary tools corresponding to the target FPGA programmable logic circuit.
[0087] Next, the user runs the scripts of the constructor block 6 automatically generating the configuration of the target FPGA programmable logic circuit, in order to provide the hardware accelerator. When the user has this output, he or she uses the driver block 7 to load the description of the accelerator (“ResNet-50” network) into the target FPGA programmable logic circuit, provide the configuration of the pre-trained then binarized weights to the hardware accelerator, provide a set of input images, and retrieve the results of the neural network algorithm as output from the hardware accelerator.
[0088] It is possible to dispense with the relatively time-consuming portion of generating the hardware architecture from the “PyTorch” representation, provided that models are used from the library of precompiled networks. If the user chooses a hardware accelerator whose topology has already been generated (by the user or provided by the library of precompiled neural network algorithms), he or she only has to go through the step of model weight binarization, which is very fast, for example about a second.
[0089] Of course, the invention is not limited to the examples and to the embodiment described and shown, but is capable of numerous variants accessible to those skilled in the art.