Dynamic multi-mode CNN accelerator and operating methods
11645365 · 2023-05-09
Assignee
Inventors
Cpc classification
G06F1/3203
PHYSICS
International classification
G06F9/38
PHYSICS
G06F1/3203
PHYSICS
G06F9/30
PHYSICS
Abstract
A convolutional neural network (CNN) operation accelerator comprising a first sub-accelerator and a second sub-accelerator is provided. The first sub-accelerator comprises I units of CNN processor cores, J units of element-wise & quantize processors, and K units of pool and nonlinear function processor. The second sub-accelerator comprises X units of CNN processor cores, Y units of first element-wise & quantize processors, and Z units of pool and nonlinear function processor. The above variables I˜K, X˜Z are all greater than 0, and at least one of the three relations, namely, “I is different from X”, “J is different from Y”, and “K is different from Z”, is satisfied. A to-be-performed CNN operation comprises a first partial CNN operation and a second partial CNN operation. The first sub-accelerator and the second sub-accelerator perform the first partial CNN operation and the second partial CNN operation, respectively.
Claims
1. A convolutional neural network operation (CNN) accelerator configured to perform a convolutional neural network operation, which can be divided into a plurality of sub-partial operations at least comprising a first sub-partial operation and a second sub-partial operation, wherein the CNN accelerator comprises: at least two sub-accelerators, comprising: a first sub-accelerator, comprising: I units of first CNN processor cores; J units of first element-wise & quantize processors; and K units of first pool and nonlinear function processors; a second sub-accelerator, comprising: X units of second CNN processor cores; Y units of second element-wise & quantize processors; and Z units of second pool and nonlinear function processors, wherein the above variables I, J, K, X, Y, Z are all greater than 0; wherein the first sub-accelerator independently performs the first sub-partial operation; wherein the second sub-accelerator independently performs the second sub-partial operation; wherein the first sub-accelerator and the second sub-accelerator satisfy a relation of different numbers of cores, which refers to the establishment of at least one of the three relations, namely, “I is different from X”, “J is different from Y”, and “K is different from Z”, and wherein the convolutional neural network operation comprises the operation of T layers, the first sub-partial operation comprises the operation of the first M of the T layers, and the second sub-partial operation comprises the operation of the following N of the T layers; the first sub-accelerator performs the operation of the first M layers and then outputs an intermediate result to the second sub-accelerator, which then performs the operation of the following N layers according to the intermediate result.
2. The CNN accelerator according to claim 1, wherein the plurality of sub-partial operations further comprise a third partial convolutional neural network operation, and the CNN accelerator further comprise a third sub-accelerator, comprising: R units of first CNN processor cores; S units of first element-wise & quantize processors; and T units of first pool and nonlinear function processor, wherein the above variables R˜T are all greater than 0; wherein the third sub-accelerator independently performs the third partial convolutional neural network operation; wherein the first sub-accelerator and the third sub-accelerator satisfy a relation of different numbers of cores, which refers to the establishment of at least one of the three relations, namely, “I is different from R”, “J is different from S”, and “K is different from T”.
3. The CNN accelerator according to claim 1, wherein the element-wise & quantize processors are configured to process a scalar operation selected from a group of operations composed of addition, deduction, multiplication, division, batch normalization, quantization, bias and scaling.
4. The CNN accelerator according to claim 1, wherein the pool and nonlinear function processor are configured to process a non-linear activation operation selected from a group of operations composed of rectified linear unit (ReLU), Sigmoid function and Tanh function.
5. The CNN accelerator according to claim 1, wherein the first sub-accelerator performs the first sub-partial operation for a first time interval T1, the second sub-accelerator performs the second sub-partial operation for a second time interval T2, and the first time interval T1 is substantially equivalent to the second time interval T2.
6. The CNN accelerator according to claim 1, wherein the convolutional neural network operation comprises a trunk operation and a first branch operation; the first sub-partial operation comprises the trunk operation, and the second sub-partial operation comprises the first branch operation; the first sub-accelerator performs the trunk operation, and the second sub-accelerator performs the first branch operation.
7. The CNN accelerator according to claim 6, wherein when the convolutional neural network operation finishes the trunk operation, the first sub-accelerator enters a first power saving mode; when the convolutional neural network operation performs the trunk operation, the first sub-accelerator exits the first power saving mode.
8. The CNN accelerator according to claim 6, wherein when the convolutional neural network operation performs the trunk operation, the second sub-accelerator enters a second power saving mode; when the convolutional neural network operation finishes the trunk operation and intends to perform the first branch operation, the second sub-accelerator exits the second power saving mode.
9. The CNN accelerator according to claim 1, wherein the convolutional neural network operation comprises a trunk operation, a first branch operation and a second branch operation; the first sub-partial operation comprises the trunk operation, and the second sub-partial operation selectively comprises one of the first branch operation and the second branch operation; the first sub-accelerator performs the trunk operation, and the second sub-accelerator selectively performs one of the first branch operation and the second branch operation; when the convolutional neural network operation intends to perform the first branch operation, the second sub-accelerator loads in a program code corresponding to the first branch operation; when the convolutional neural network operation intends to performs the second branch operation, the second sub-accelerator loads in a program code corresponding to the second branch operation.
10. An operating method adaptable to a convolutional neural network operation (CNN) accelerator configured to perform a convolutional neural network operation, which can be divided into a plurality of sub-partial operations at least comprising a first sub-partial operation and a second sub-partial operation, wherein the CNN accelerator comprise two sub-accelerators comprising a first sub-accelerator and a second sub-accelerator; the first sub-accelerator comprises I units of first CNN processor cores, J units of first element-wise & quantize processors, and K units of first pool and nonlinear function processor; the second sub-accelerator comprises X units of second CNN processor cores, Y units of second element-wise & quantize processors, and Z units of second pool and nonlinear function processor; the first sub-accelerator and the second sub-accelerator satisfy a relation of different numbers of cores, which refers to the establishment of at least one of the three relations, namely, “I is different from X”, “J is different from Y”, and “K is different from Z”; the operating method comprises the steps of: performing the first sub-partial operation by the first sub-accelerator; performing the second sub-partial operation by the second sub-accelerator; when the convolutional neural network operation finishes the first sub-partial operation, the first sub-accelerator enters a first power saving mode, and when the convolutional neural network operation performs the first sub-partial operation, the first sub-accelerator exits the first power saving mode.
11. The operating method according to claim 10, wherein the first sub-partial operation comprises a trunk operation, and the second sub-partial operation comprises a first branch operation; the operating method further comprises the step of: when the convolutional neural network operation finishes the first branch operation and intends to perform the trunk operation, the second sub-accelerator enters a second power saving mode; when the convolutional neural network operation finishes the trunk operation and intends to perform the first branch operation, the second sub-accelerator exits the second power saving mode.
12. The operating method according to claim 10, wherein the first sub-partial operation comprises a trunk operation, and the second sub-partial operation selectively comprises one of a first branch operation and a second branch operation; the operating method further comprises the step of: when the convolutional neural network operation finishes the trunk operation and intends to perform the first branch operation, the second sub-accelerator loads in a program code corresponding to the first branch operation; when the convolutional neural network operation finishes the trunk operation and intends to perform the second branch operation, the second sub-accelerator loads in a program code corresponding to the second branch operation.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6) In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the disclosed embodiments. It will be apparent, however, that one or more embodiments may be practiced without these specific details. In other instances, well-known structures and devices are schematically shown in order to simplify the drawing.
DETAILED DESCRIPTION
(7)
(8)
(9) The three sub-accelerators have different numbers of operation processors.
(10) For example, the first sub-accelerator 14 comprises I units of first CNN processor cores, J units of first element-wise & quantize processors, and K units of first pool and nonlinear function processor; the second sub-accelerator 16 comprises X units of second CNN processor cores, Y units of second element-wise & quantize processors, and Z units of second pool and nonlinear function processor; the third sub-accelerator 18 comprises R units of first CNN processor cores, S units of first element-wise & quantize processors, and T units of first pool and nonlinear function processor, wherein the above variables I˜K, R˜T, X˜Z all are natural numbers greater than 0. The feature of the first sub-accelerator 14 and the second sub-accelerator 16 comprising different number of cores refers to the establishment of at least one of the three relations, namely, “I is different from X”, “J is different from Y”, and “K is different from Z”. Similarly, the feature of the first sub-accelerator 14 and the third sub-accelerator 18 comprising different number of cores refers to the establishment of at least one of the three relations, namely, “A is different from R”, “B is different from S”, and “C is different from T”.
(11) As indicated in
(12) The element-wise & quantize processors 14B are configured to process a scalar operation selected from a group of operations composed of addition, deduction, multiplication, division, batch normalization, quantization, bias and scaling. The pool and nonlinear function processor 14C are configured to process a non-linear activation operation selected from a group of operations composed of rectified linear unit (ReLU), Sigmoid function and Tan h function.
First Embodiment
(13)
(14) Refer to
(15) In view of the timing sequence diagram, (1) within the time interval T1, the first sub-accelerator 14 performs the operation of the 1.sup.st layer of the first convolution, and the second sub-accelerator 16, the third sub-accelerator 18 and the fourth sub-accelerator 19 all are waiting; (2) within the time interval T2, the first sub-accelerator 14 performs the operation of the 1.sup.st layer of the second convolution, the second sub-accelerator 16 performs the operation of the 2.sup.nd layer to the 4.sup.th layer of the first convolution, and the third sub-accelerator 18 and the fourth sub-accelerator 19 both are waiting; (3) within the time interval T3, the first sub-accelerator 14 performs the operation of the 1.sup.st layer of the third convolution, the second sub-accelerator 16 performs the operation of the 2.sup.nd layer to the 4.sup.th layer of the second convolution, the third sub-accelerator 18 performs the operation of the 5.sup.th layer to the 7.sup.th layer of the first convolution, and the fourth sub-accelerator 19 is waiting; (4) within the time interval T4, the first sub-accelerator 14 performs the operation of the 1.sup.st layer of the fourth convolution, the second sub-accelerator 16 performs the operation of the 2.sup.nd layer to the 4.sup.th layer of the third convolution, the third sub-accelerator 18 performs the operation of the 5.sup.th layer to the 7.sup.th layer of the second convolution, and the fourth sub-accelerator 19 performs the operation of the 8.sup.th layer and the 9.sup.th layer of the first convolution. From the time interval T5, each unit time interval T will output a result of the convolutional operation of nine layers. From the time interval T4, the four sub-accelerators of the multi-stage pipeline operation architecture start to operate simultaneously and achieve the effect of parallel processing.
(16) The unit time interval T is determined according to the longest time among the times required for each sub-accelerator to complete corresponding allocated CNN operation. For example, suppose the first sub-accelerator 14 takes 3.61 nsec to complete the operation of the 1.sup.st layer of the convolution, the second sub-accelerator 16 takes 3.61 nsec to complete the operation of the 2.sup.nd layer to the 4.sup.th layer of the convolution, the third sub-accelerator 18 takes 3.21 nsec to complete the operation of the 5.sup.th layer to the 7.sup.th layer of the convolution, and the fourth sub-accelerator 19 takes 4.51 nsec to complete the operation of the 8.sup.th layer and the 9.sup.th layer of the convolution, then the unit time interval T will be set to be greater than 4.51 nsec.
(17) If the focus is to increase the efficiency of the operation of the multi-stage pipeline architecture, then more attention is paid to the longest time among the times required for each sub-accelerator to perform corresponding CNN operation, so that the length of the unit time interval T for one pipeline stage can be reduced. During the design planning stage, a second allocation arrangement can be tried. For example, under the second allocation arrangement, suppose the first sub-accelerator 14 takes 4.7 nsec to perform the operation of the 1.sup.st to the 2.sup.nd layer of the convolution, the second sub-accelerator 16 takes 3.7 nsec to perform the operation of the 3.sup.rd to the 5.sup.th layer of the convolution, the third sub-accelerator 18 takes 3.31 nsec to perform the operation of the 6.sup.th layer to the 8.sup.th layer of the convolution, and the fourth sub-accelerator 19 takes 4.0 nsec to perform the operation of the 9.sup.th layer of the convolution, then the longest time is over 4.51 nsec. It shows the first allocation arrangement shown in the
(18) If the focus is to decrease the power consumption of the operation of the multi-stage pipeline architecture, then more attention is paid to the sum of the power consumption required for each sub-accelerator to perform the operation of the corresponding layer(s). During the design planning stage, the first allocation arrangement as indicated in
(19) The above disclosure is exemplified by four sub-accelerators. When the number of sub-accelerators changes, the difference of the longest time interval deducted by the shortest time interval, the ratio of the longest time interval to the shortest time interval, or the sum of power consumption still can be used as a criterion for allocating operation to different sub-accelerators, so a shorter time to complete overall CNN operation is achieved.
Second Embodiment
(20)
(21) To save the power consumption of the entire accelerator, in step 37, after check point computing, whether the CNN accelerator should enable one of the branch operations 36 and 38 is determined. When the convolutional neural network operation completes the trunk operation 34 and intends to perform the first branch operation 36, the first sub-accelerator 14 enters the first power saving mode, and the process proceeds to step 39. In step 39, the corresponding second sub-accelerator 16 is enabled and exits the second power saving mode. When the convolutional neural network operation finishes the first branch operation 36 and intends to perform the trunk operation 34, the first sub-accelerator 14 exits the first power saving mode, and the second sub-accelerator 16 enters the second power saving mode. Thus, the effect of dynamic power saving can be achieved. Similarly, when the convolutional neural network operation switches between the trunk operation 34 and the second branch operation 38, the third sub-accelerator 18 will also selectively exit and enter the third power saving mode.
(22) This another embodiment is mainly used in the scenario of dynamic branch network operation. The to-be-recognized images can be divided into a primary category and a secondary category. The trunk operation 34 is a main image recognition which has a higher frequency or requires a longer time of continuous operation. The first branch operation 36 is a scenario image recognition which has a lower frequency or operates occasionally for a shorter time. The trunk operation 34 and the first branch operation share the front-end feature map of the neural network in the trunk operation 34. For example, the trunk operation can be a recognition operation of main image for recognizing a main factory object which is seen more frequently, and the first branch operation can be a recognition operation of scenario image for recognizing a customized object which is seen less frequently. Besides, the trunk operation can be a recognition operation of collision sound, and the branch operation can be an operation for recognizing and analyzing an audio source for the collision sound.
Third Embodiment
(23)
(24) To save the power consumption of the entire accelerator, in step 47, after the check point computing, whether the CNN accelerator should enable one of the branch operations 46 and 48 is determined. When the convolutional neural network operation finishes the trunk operation 44 and intends to perform the first branch operation 46, the first sub-accelerator 14 enters the first power saving mode, and the process proceeds to step 49. In step 49, the second sub-accelerator 16 is enabled and exits the second power saving mode. When the convolutional neural network operation finishes the first branch operation 46 and intends to perform the trunk operation 44, the first sub-accelerator 14 exits the first power saving mode, and the second sub-accelerator 16 enters the second power saving mode. Thus, the effect of dynamic power saving can be achieved. Similarly, when the convolutional neural network operation finishes the trunk operation 44 and switches between the second branch operation 48, the process proceeds to step 51. In step 51, the second sub-accelerator 16 is enabled and exits the second power saving mode.
(25) The another embodiment of the disclosure is mainly used in an event-triggered dynamic tree network operation. That is, the neural network in the trunk operation 44 performs object detection and roughly divide the detected objects into two types, A and B. When an object A is detected, the neural network in the first branch operation 46 performs fine recognition on the object A. When an object B is detected, the neural network in the second branch operation 48 performs fine recognition on the object B. A specific embodiment could be as follows: the neural network in the trunk operation 44 is used for vehicle detection on autopilot. When a person on the roadside is detected, the second sub-accelerator 16 loads in a program code corresponding to the neural network in the first branch operation 46 to perform a branch operation relevant to human recognition. When an adjacent vehicle is detected, the second sub-accelerator 16 loads in a program code corresponding to the neural network in the second branch operation 48 to perform a branch operation relevant to vehicle recognition. If the neural network in the trunk operation 44 does not detect any person or vehicle, then the neural network in the trunk operation 44 continues and there is no need to trigger the first branch operation 46 or the second branch operation 48.
(26) Refer to Table 3 and Table 4, which show a comparison of operation time and power consumption between the dynamic multi-mode CNN accelerator of the first embodiment as indicated in
(27) As indicated in Table 3: based on the architecture of the single 2048-core accelerator of the prior art, a complete result of convolutional neural network operation is outputted every 4.7 nsec; based on the architecture of the four sub-accelerators CNN accelerators of the disclosure, a complete result of convolutional neural network operation is outputted every 4.51 nsec. As indicated in Table 4: based on the architecture of the single 2048-core accelerator of the prior art, the operation consumes a power of 420 mW; based on the architecture of the four sub-accelerators CNN accelerators of the disclosure, the operation consumes a power of 370 mW. The above comparison shows that the architecture of the dynamic multi-mode CNN accelerator of the disclosure comprising at least two sub-accelerators is superior to the architecture of the single 2048-core CNN accelerator of the prior art in terms of both operation time and power consumption.
(28) TABLE-US-00003 TABLE 3 Cycle of Cycle of Cycle of Cycle of 64-Core 256-Core 256-Core 64-Core Cycle of Memory Memory Memory Memory CNN convolutional 2048 bandwidth bandwidth bandwidth bandwidth model cores 2 GB/s 1 GB/s 1 GB/s 4 GB/s 1-layer convolution 1.81 3.61 CONV1 2-layer convolution 0.45 1.81 CONV2 3-layer convolution 0.23 0.90 CONV3 4-layer convolution 0.11 0.90 CONV4 5-layer convolution 0.11 0.90 CONV5 6-layer convolution 0.11 0.90 CONV6 7-layer convolution 0.35 1.40 CONV7 8-layer convolution 0.18 1.81 CONV8 9-layer convolution 1.35 2.70 CONV9 Sum 4.70 3.61 3.61 3.21 4.51
(29) TABLE-US-00004 TABLE 4 Single 2048-core Dual 64 and Dual 256 accelerator 640 cores in total Frequency 600 MHz All are 600 MHz Memory 8 GB/s Respectively: 1,1,2,4 GB/s bandwidth Sum: 8 GB/s Internal memory 512 KB 512 KB (128 KB each) Number of 4.7 million 4.5 million operating clocks Simulated power About 420 mW About 370 mW consumption Power saving Coarse grained Fine grained function
(30) It will be apparent to those skilled in the art that various modifications and variations can be made to the disclosed embodiments. It is intended that the specification and examples be considered as exemplary only, with a true scope of the disclosure being indicated by the following claims and their equivalents.