DEEP LEARNING CONVOLUTION ACCELERATION METHOD USING BIT-LEVEL SPARSITY, AND PROCESSOR
20250284766 ยท 2025-09-11
Assignee
Inventors
Cpc classification
G06F9/5027
PHYSICS
G06F17/16
PHYSICS
International classification
Abstract
The present application provides a deep learning convolution acceleration method using bit-level sparsity and a processor. Comprises: selecting the maximum sum of the exponents from all data pairs to be convolved as a maximum exponent; arranging mantissas of the original weights in a computation sequence to form a weight matrix, and uniformly aligning each row of the weight matrix to the maximum exponent and removing slack bits to obtain a reduced matrix, allowing essential bits in each column of the reduced matrix to fill the vacancies according to the computation sequence, after removing null rows in the intermediate matrix, placing zeros at vacancies of the matrix to obtain an interleaved weight matrix, sending the weight segments in each row of the interleaved weight matrix and the mantissa of the corresponding activation to an adder tree for processing summation, by shifting and adding the sum result to obtain a convolution result.
Claims
1. A deep learning convolution acceleration method using bit-level sparsity, comprising: step 1, acquiring multiple groups of data pairs to be convolved, wherein each group of data pairs is formed of an activation and a corresponding original weight, and the activation and the original weight are both floating-point numbers; step 2, summing exponents of the activation and the original weight in each group of data pairs to obtain a sum of the exponents of each group of data pairs, and selecting the maximum sum of the exponents from all data pairs as a maximum exponent; step 3, arranging mantissas of the original weights in a computation sequence to form a weight matrix, and uniformly aligning each row of the weight matrix to the maximum exponent to obtain an alignment matrix; step 4, removing slack bits in the alignment matrix to obtain a reduced matrix with vacancies, allowing essential bits in each column of the reduced matrix to fill the vacancies according to the computation sequence to form an intermediate matrix, and after removing null rows in the intermediate matrix, placing zeros at vacancies of the matrix to obtain an interleaved weight matrix, wherein each row of the interleaved weight matrix serves as a necessary weight; and step 5, obtaining, according to a correspondence relationship between the activations and the essential bits in the original weights, positional information of the activation corresponding to each bit of the necessary weight, sending the necessary weight to a split accumulator, which divides the necessary weight by bit into multiple weight segments, sending the weight segments and the mantissa of the corresponding activation to an adder tree for processing summation according to the positional information, and obtaining an output feature map as a convolution result of the multiple groups of data pairs by means of executing shift-and-add on the processing result.
2. The deep learning convolution acceleration method using bit-level sparsity according to claim 1, wherein the activations are pixel values of an image.
3. A processor for carrying out the deep learning convolution acceleration method using bit-level sparsity according to claim 1.
4. The processor according to claim 3, comprising: a pre-process module for acquiring multiple groups of data pairs to be convolved, wherein each group of data pairs is formed of an activation and a corresponding original weight, and the activation and the original weight are both floating-point numbers; summing exponents of the activation and the original weight in each group of data pairs to obtain a sum of the exponents of each group of data pairs, and selecting the maximum sum of the exponents from all data pairs as a maximum exponent; an exponent alignment module for arranging mantissas of the original weights in a computation sequence to form a weight matrix, and uniformly aligning each row of the weight matrix to the maximum exponent to obtain an alignment matrix; a weight interleaved module for removing slack bits in the alignment matrix to obtain a reduced matrix with vacancies, allowing essential bits in each column of the reduced matrix to fill the vacancies according to the computation sequence to form an intermediate matrix, and after removing null rows in the intermediate matrix, placing zeros at vacancies of the matrix to obtain an interleaved weight matrix, wherein each row of the interleaved weight matrix serves as a necessary weight; and a circulating register for extracting essential bits in the necessary weight, and obtaining positional information of the activation corresponding to each bit of the necessary weight from the corresponding mantissa in the mantissas of all activations; and a split accumulator for dividing the necessary weight by bit into multiple weight segments, sending the weight segments and the mantissa of the corresponding activation to an adder tree for processing summation according to the positional information, and obtaining an output feature map as a convolution result of the multiple groups of data pairs by means of executing shift-and-add on the processing result.
5. The processor according to claim 4, wherein the activations are pixel values of an image.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0029]
[0030]
[0031]
[0032]
[0033]
DETAILED DESCRIPTION
[0034] The weaknesses of the technique are mainly caused by using sparsity of values. In research of the present application, we find out that bit sparsity is an inherent finer sparsity for zero bits in each operand, not zeros of coarseness. The floating-point numbers or fixed-point numbers are used to represent weights or activations, and in different DNN models, zero bit percentage can reach 45% to 77%. Skipping zero bits in the operand does not affect result, which also means that if strictly executing bit-level valid computation, acceleration can be directly obtained without any effort at software level. Therefore, the present application accelerates training and inference phases using rich bit-level sparse parallelism to serve general-purpose deep learning at cloud/edge end.
[0035] In table 1, we classify the most advanced accelerators based on sparsity. In the early bit-parallel accelerators, i.e., Cambricon and SCNN, research on sparsity only focuses on the numerical values. More zero sparsity space is created to release potential of these accelerators by using pruning at software level. Considering that bit sparsity is rich in weights and activations, the recent research on the bit-serial accelerators has focused on the bit-level sparsity. Recently, Laconic uses terms to serial extract necessary bits after performing Booth coding, and proposes a low cost LPE to reduce an increase of power consumption due to frequent coding/decoding. Tactical solves the problem of sparsity at bit level of weights and activations. The design concept is similar with that of Pragmatic, which are both to optimize invalid operation by skipping zero bits, but Tactical skips zero weights depending on a front-end irrelevant to data types, and a software scheduler to maximize possibility of skipping the weight. Currently, there are also some sparsity design modes following bit-serial computation. For example, Stripes and UNPU achieve bit serialization of the fixed-point operands without avoiding sparsity. Bit-fusion supports a fast space and time combination to accelerate bit serialization, but still cannot well utilize sparsity of bits.
TABLE-US-00001 TABLE 1 Philos. Design Sparsity Exploited Preci. V. Training Support Bitparallel Eyeriss, N/A 16 b No DianNao Cambricon-S, A/W-value 16 b No EIE SCNN A&W-value 16 b No bit serial UNPU, N/A 1~16 b No Stripes Bit Fusion N/A 2, 4, 8, 16 b No Pragmatic A-/W-bit 1~16 b No Bit Tactical A-bit&W-value 1~16 b No Laconic A&W-bit 1~16 b No bit interleaving Bitlet (this work) W-bit&W-value, fp 32/16, Yes (or A-bit&A-value) 1~24 b
[0036] Meanwhile, the previous tasks have proved that the bit-level sparsity is rich. However, the previous tasks only focus on exploring the strategy of skipping zero bits in specific weights, while not exploring sparsity between the weights.
[0037] As shown in
[0038] Although the fixed-point accuracy represents success in efficient DNN inference, it also causes that the accelerators designed for the fixed-point accuracy can achieve inference only, such that these designs are difficult to be applied to general-purpose scenarios. For example, training of the DNN still depends on floating-point backpropagation to ensure adjustment of the models to the floating points, but still shall satisfy the real-time requirement, in particular, when the fixed-point accuracy cannot satisfy the corresponding accuracy. In an ideal case, the accelerators shall suit for most use cases, and shall cooperatively provide enough convenience and flexibility for terminal users.
[0039] Based on the exploration, the present application provides a parallel design mode based on bit-interleaved sparsity. Advantage of the bit-serial accelerators is to effectively utilize sparsity of the bits. However, throughput provided by the bit-serial accelerators is relatively lower than that of the corresponding bit-parallel accelerators. On the basis of the two design concepts, the present application provides bit-interleaved design, and combining with the advantage of the design while avoiding disadvantage, such design mode can significantly exceed the preceding bit-serial/parallel mode. The accelerator Bitlet uses the bit-interleaved design concept, and also supports several accuracies comprising floating points and fixed points. Such configurable properties allow Bitlet to suit for high performance, and also suit for low consumption scenarios.
[0040] To make the above features and effects of the present application clearer, hereinafter explanations are made in details with reference to examples and the accompanying drawings.
[0041] Hereinafter the present application is explained in details:
1. Bit Interleaving
[0042] Without loss of generality, a floating-point operand is consisting of three parts, a sign bit, a mantissa and an exponent, and follows the standard IEEE754, which is also the most common floating-point standard in the industry. If we use single accuracy floating-point number (fp32), a bit width of the mantissa is 23 bits, a bit width of the exponent is 8 bits, and the remaining bit is the sign bit. One single accuracy floating-point weight may be represented by fp=(1).sup.s1.Math.m2.sup.e-127, and e is adding 127 at the actual position of decimal point of the floating-point number. We compute partial sum of convolution using MAC with a series of floating-point 32-bit single accuracy numbers.
[0043] Formula 1: converting W.sub.i into fp32 representation, wherein M.sub.W.sub.
[0044] wherein M.sub.W.sub.
[0045] According to formula 5, it can be inferred that a result of N fp32 MACs corresponds to a series of bit-level operations of the corresponding mantissas. Specifically, if M.sub.W.sub.
), andon such basis, left (right) shifting 2.sup.E.sup.
[0046] The analysis shows that in the case of considering sparsity, partial sum of the floating-point numbers can be converted into bit-level operations. The product is mainly formed of the mantissa M.sub.A.sub.
[0047] The computing theory also includes fixed-point accuracy. In the formula 5, E.sub.max and E.sub.iE.sub.max are not necessary, because the fixed-point accuracy shows no exponents. The present application explicitly describes how bit interleaving works in the floating-point 32-bit accuracy weights, and supports design details of the multiple accuracy Bitlet accelerator.
[0048]
1 Pre-Processing
[0049]
2: Dynamic Exponent Matching:
[0050] The exponent represents position of decimal points in binary representation. Traditionally, it involves the exponent matching step in the floating-point addition. However, in bit interleaving, we often match by uniformly aligning a group of floating-point exponents to the maximum value (E.sub.6 in the example), instead of processing one by one. The step is referred to as dynamic exponent matching, and
[0051] Reviewing formula 5, in actual execution, the two summations can be parallel executed. External summation represents a vertical dimension in
[0052] Since our final goal is to compute .sub.i=0.sup.N-1A.sub.iW.sub.i, it involves computation of N weights and activations. Therefore, all exponents are aligned to their maximum values in each execution, instead of gradually matching. As can be seen from
3: Extraction of Necessary Bits
[0053] Currently, the key is how to obtain the accurate partial sum using necessary bits, and further obtain better inference speed. Considering of sparse parallelism mentioned above, the step extracts necessary bits using the feature, which is completely the same as the step 2 in
[0054] As shown in
2. Bitlet Accelerator
[0055] In order to execute bit interleaving, we design a new accelerator, which is named Bitlet. In this part, we will set forth key hardware design modules of Bitlet, including a microarchitecture for supporting multiple accuracy compute engines and an overall architecture for efficient memory access.
[0056] Key module 1Pre-process module. Firstly, the present application designs a component involving two steps in bit-interleaved operation. Bitlet inputs multiple pairs of weights and activations, which are represented by N in
[0057] Key module 2Wire orchestrator. After dynamic exponent matching, we obtain a 24-bit mantissa after shifting, which is represented by M.sub.W.sub.
[0058] Key module 3Circulating register RR-reg. RR-reg extracts necessary bits 1 (essential bits) in the interleaved weight, and selects outputs of the BCE from N activation mantissas. Each RR-reg has an internal clock, and is connected to a clock tree of the accelerator. As shown in
[0059] The BCE has the following three features: {circle around (1)} the architecture does not bring accuracy loss, because the dynamic exponent matching is the same as the floating-point operation in IEEE 754. The rightmost bits after shifting are discarded in the operation, but these bit values are tiny and can be ignored without influence on accuracy. {circle around (2)} The BCE does not require any pre-processing on parameter sparsity. The pre-process module in
3. Architecture of the Accelerator
[0060] PE: Bitlet is formed of mesh-connected PEs. As shown in
[0061] Memory system: in order to achieve high throughput, the Bitlet accelerator provides a separated DMA lane for activations and weights. As shown in
4. Flexibility of Bitlet
[0062] The Bitlet accelerator supports multiple accuracy computation, can be conveniently configured to be a fixed-point mode, and provides enough flexibility for terminal users. For example, if using 16-bit fixed-point accuracy, the pre-process module for executing exponent matching and shifting (>>E.sub.maxE.sub.W.sub.
[0063] Hereinafter system embodiment corresponding to the method embodiment is explained, and this embodiment can be carried out combining with the above embodiment. The relevant technical details mentioned in the above embodiment are still effective in this embodiment, and in order to reduce repetition, the details are not described here. Correspondingly, relevant technical details mentioned in this embodiment also can be applied to the above embodiment.
[0064] The present application further provides a processor for carrying out the deep learning convolution acceleration method using bit-level sparsity.
[0065] The processor comprises: [0066] a pre-process module for acquiring multiple groups of data pairs to be convolved, wherein each group of data pairs is formed of an activation and a corresponding original weight, and the activation and the original weight are both floating-point numbers; summing exponents of the activation and the original weight in each group of data pairs to obtain a sum of the exponents of each group of data pairs, and selecting the maximum sum of the exponents from all data pairs as a maximum exponent; [0067] an exponent alignment module for arranging mantissas of the original weights in a computation sequence to form a weight matrix, and uniformly aligning each row of the weight matrix to the maximum exponent to obtain an alignment matrix; [0068] a weight interleaved module for removing slack bits in the alignment matrix to obtain a reduced matrix with vacancies, allowing essential bits in each column of the reduced matrix to fill the vacancies according to the computation sequence to form an intermediate matrix, and after removing null rows in the intermediate matrix, placing zeros at vacancies of the matrix to obtain an interleaved weight matrix, wherein each row of the interleaved weight matrix serves as a necessary weight; and [0069] a circulating register for extracting essential bits in the necessary weight, and obtaining positional information of the activation corresponding to each bit of the necessary weight from the corresponding mantissa in the mantissas of all activations; and [0070] a split accumulator for dividing the necessary weight by bit into multiple weight segments, sending the weight segments and the mantissa of the corresponding activation to an adder tree for processing summation according to the positional information, and obtaining an output feature map as a convolution result of the multiple groups of data pairs by means of executing shift-and-add on the processing result.
[0071] In the processor, the activations are pixel values of an image.
INDUSTRIAL APPLICABILITY
[0072] The present application provides a deep learning convolution acceleration method using bit-level sparsity, and a processor. The method comprises: acquiring multiple groups of data pairs to be convolved, summing exponents of the activation and the original weight in each group of data pairs to obtain a sum of the exponents of each group of data pairs, and selecting the maximum sum of the exponents from all data pairs as a maximum exponent; arranging mantissas of the original weights in a computation sequence to form a weight matrix, and uniformly aligning each row of the weight matrix to the maximum exponent to obtain an alignment matrix; removing slack bits in the alignment matrix to obtain a reduced matrix, allowing essential bits in each column of the reduced matrix to fill the vacancies according to the computation sequence to form an intermediate matrix, after removing null rows in the intermediate matrix, placing zeros at vacancies of the matrix to obtain an interleaved weight matrix, sending the weight segments in each row of the interleaved weight matrix and the mantissa of the corresponding activation to an adder tree for processing summation, and obtaining an output feature map as a convolution result of the multiple groups of data pairs by means of executing shift-and-add on the processing result.