Fast vector multiplication and accumulation circuit
10908879 ยท 2021-02-02
Assignee
Inventors
Cpc classification
G06F17/16
PHYSICS
International classification
G06F17/17
PHYSICS
G06F7/53
PHYSICS
Abstract
A fast vector multiplication and accumulation circuit is applied to an artificial neural network accelerator and configured to calculate an inner product of a multiplier vector and a multiplicand vector. A scheduler is configured to arrange a plurality of multiplicands of the multiplicand vector into a plurality of scheduled operands according to a plurality of multipliers of the multiplier vector, respectively. A self-accumulating adder is signally connected to the scheduler and includes a compressor, at least two delay elements and at least one shifter. The compressor is configured to add the scheduled operands to generate a plurality of compressed operands. The at least two delay elements are connected to the compressor. The shifter is configured to shift one of the compressed operands. An adder is signally connected to the output ports of the compressor so as to add the compressed operands to generate the inner product.
Claims
1. A fast vector multiplication and accumulation circuit, which is applied to an artificial neural network accelerator and configured to calculate an inner product of a multiplier vector and a multiplicand vector, the fast vector multiplication and accumulation circuit comprising: a scheduler configured to arrange a plurality of multiplicands of the multiplicand vector into a plurality of scheduled operands according to a plurality of multipliers of the multiplier vector, respectively; a self-accumulating adder signally connected to the scheduler and comprising: a compressor having a plurality of input ports and a plurality of output ports, wherein one of the input ports sequentially receives the scheduled operands, the compressor is configured to add the scheduled operands to generate a plurality of compressed operands, and the compressed operands are transmitted via the output ports; at least two delay elements connected to other two of the input ports of the compressor, respectively, wherein one of the at least two delay elements is connected to one of the output ports; and at least one shifter connected between another one of the output ports and another one of the at least two delay elements, wherein the at least one shifter is configured to shift one of the compressed operands; and an adder signally connected to the output ports of the compressor so as to add the compressed operands to generate the inner product.
2. The fast vector multiplication and accumulation circuit of claim 1, further comprising: an activation unit signally connected to the adder, wherein the activation unit receives the inner product and implements a non-linear operation.
3. The fast vector multiplication and accumulation circuit of claim 2, wherein the non-linear operation comprises a sigmoid function, a signum function, a threshold function, a piecewise-linear function, a step function or a tanh function.
4. The fast vector multiplication and accumulation circuit of claim 2, wherein the non-linear operation is implemented as a piecewise quadratic approximation.
5. The fast vector multiplication and accumulation circuit of claim 1, wherein the compressor is a full adder, the full adder has a first input port, a second input port, a third input port, a first output port and a second output port, one of the at least two delay elements is disposed between the first input port and the first output port, the other one of the at least two delay elements is disposed between the second input port and the second output port, and the third input port is signally connected to the scheduler.
6. The fast vector multiplication and accumulation circuit of claim 1, wherein the compressor is a 7-to-3 compressor, the 7-to-3 compressor has a first input port, a second input port, a third input port, a fourth input port, a fifth input port, a sixth input port, a seventh input port, a first output port, a second output port and a third output port, the at least two delay elements comprise a first delay element and a second delay element, the at least one shifter comprises a first shifter and a second shifter, the self-accumulating adder further comprises a third delay element, the first delay element is disposed between the first input port and the first output port, the second delay element and the second shifter are disposed between the second input port and the third output port, the third delay element and the first shifter are disposed between the third input port and the second output port, and the fourth input port, the fifth input port, the sixth input port and the seventh input port are signally connected to the scheduler.
7. The fast vector multiplication and accumulation circuit of claim 1, wherein the adder is implemented as a carry look-ahead adder, a carry propagate adder, a carry save adder or a ripple carry adder.
8. The fast vector multiplication and accumulation circuit of claim 1, further comprising: a controlling processor signally connected to the scheduler, the self-accumulating adder and the adder, wherein the controlling processor is configured to control the scheduler, the self-accumulating adder and the adder.
9. The fast vector multiplication and accumulation circuit of claim 1, wherein the scheduler comprises: at least one priority encoder sequentially receiving the multipliers of the multiplier vector, wherein the at least one priority encoder determines at least one valid bit position of each of the multipliers; and at least one barrel shifter sequentially receiving the multiplicands of the multiplicand vector, wherein the at least one barrel shifter is signally connected to the at least one priority encoder, and the at least one barrel shifter is configured to shift the multiplicands to arrange the multiplicands into the scheduled operands according to the at least one valid bit position.
10. The fast vector multiplication and accumulation circuit of claim 1, wherein the fast vector multiplication and accumulation circuit is implemented as an application specific integrated circuit (ASIC) on a semiconductor process, and the semiconductor process comprises a complementary metal-oxide-semiconductor (CMOS) process or a silicon on insulator (SOI) process.
11. The fast vector multiplication and accumulation circuit of claim 1, wherein the fast vector multiplication and accumulation circuit is implemented as a field programmable gate array (FPGA).
12. A fast vector multiplication and accumulation circuit, which is applied to an artificial neural network accelerator and configured to calculate an inner product of a multiplier vector and a multiplicand vector, the fast vector multiplication and accumulation circuit comprising: a scheduler configured to arrange a plurality of multiplicands of the multiplicand vector into a plurality of scheduled operands according to a plurality of multipliers of the multiplier vector, respectively; a self-accumulating adder signally connected to the scheduler, wherein the self-accumulating adder is configured to add the scheduled operands to generate a plurality of compressed operands; and an adder signally connected to the self-accumulating adder so as to add the compressed operands to generate the inner product.
13. The fast vector multiplication and accumulation circuit of claim 12, wherein the scheduler comprises: at least one priority encoder sequentially receiving the multipliers of the multiplier vector, wherein the at least one priority encoder determines at least one valid bit position of each of the multipliers; and at least one barrel shifter sequentially receiving the multiplicands of the multiplicand vector, wherein the at least one barrel shifter is signally connected to the at least one priority encoder, and the at least one barrel shifter is configured to shift the multiplicands to arrange the multiplicands into the scheduled operands according to the at least one valid bit position.
14. The fast vector multiplication and accumulation circuit of claim 12, wherein the adder is implemented as a carry look-ahead adder, a carry propagate adder, a carry save adder or a ripple carry adder.
15. The fast vector multiplication and accumulation circuit of claim 12, wherein the self-accumulating adder comprises: a compressor having a plurality of input ports and a plurality of output ports, wherein one of the input ports sequentially receives the scheduled operands, the compressor is configured to add the scheduled operands to generate the compressed operands, and the compressed operands are transmitted via the output ports; and at least two delay elements connected to other two of the input ports of the compressor, respectively, wherein one of the at least two delay elements is connected to one of the output ports.
16. The fast vector multiplication and accumulation circuit of claim 15, wherein the compressor is a full adder, the full adder has a first input port, a second input port, a third input port, a first output port and a second output port, one of the at least two delay elements is disposed between the first input port and the first output port, the other one of the at least two delay elements is disposed between the second input port and the second output port, and the third input port is signally connected to the scheduler.
17. The fast vector multiplication and accumulation circuit of claim 15, wherein the compressor is a 7-to-3 compressor, the 7-to-3 compressor has a first input port, a second input port, a third input port, a fourth input port, a fifth input port, a sixth input port, a seventh input port, a first output port, a second output port and a third output port, the at least two delay elements comprise a first delay element, a second delay element and a third delay element, the first delay element is disposed between the first input port and the first output port, the second delay element is disposed between the second input port and the third output port, the third delay element is disposed between the third input port and the second output port, and the fourth input port, the fifth input port, the sixth input port and the seventh input port are signally connected to the scheduler.
18. The fast vector multiplication and accumulation circuit of claim 12, wherein the fast vector multiplication and accumulation circuit is implemented as an application specific integrated circuit (ASIC) on a semiconductor process, and the semiconductor process comprises a complementary metal-oxide-semiconductor (CMOS) process or a silicon on insulator (SOI) process.
19. The fast vector multiplication and accumulation circuit of claim 12, wherein the fast vector multiplication and accumulation circuit is implemented as a field programmable gate array (FPGA).
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) The present disclosure can be more fully understood by reading the following detailed description of the embodiment, with reference made to the accompanying drawings as follows:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
DETAILED DESCRIPTION
(17)
(18) The scheduler 200 is configured to arrange a plurality of multiplicands of the multiplicand vector M.sub.c into a plurality of scheduled operands M.sub.s according to a plurality of multipliers of the multiplier vector M.sub.r, respectively. For example, equation (1) represents an inner product computation of the multiplier vector M.sub.r and the multiplicand vector M.sub.c. Table 1 lists the results of the inner product computation of equation (1) accomplished by the fast vector multiplication and accumulation circuit 100 of
(19)
(20) TABLE-US-00001 TABLE 1 M.sub.c[0] 00001010 (Ms) M.sub.c[0] (<<1) 00010100 (Ms) M.sub.c[0] (<<2) 00101000 (Ms) S[0] 00110110 C.sub.out[0] 000010000 M.sub.c[1] (<<2) 00111100 (Ms) S[1] 00011010 C.sub.out[1] 00110100 M.sub.c[2] 00000011 (Ms) S[2] 01110001 C.sub.out[2] 00010100 M.sub.c[2] (<<3) 00011000 (Ms) S[3] 01111101 C.sub.out[3] 00100000 10011101 = 157.sub.dec = Z
(21) In equation (1) and Table 1, it is assumed that the multiplicand vector M.sub.c includes three multiplicands M.sub.c[0], M.sub.c[1] and M.sub.c[2]. The decimal representations of the three multiplicands M.sub.c[0], M.sub.c[1] and M.sub.c[2] are 10, 15 and 3, respectively. The binary representations of the three multiplicands M.sub.c[0], M.sub.c[1] and M.sub.c[2] are 00001010, 00001111 and 00000011, respectively. The multiplier vector M.sub.r includes three multipliers. The decimal representations of the three multipliers are 7, 4 and 9, respectively. The binary representations of the three multipliers are 00000111, 00000100 and 00001001, respectively. When a first multiplicand M.sub.c[0] (i.e., 10.sub.dec and 00001010.sub.bin) is multiplied by a first multiplier (i.e., 7.sub.dec and 00000111.sub.bin), the scheduler 200 arranges the first multiplicand M.sub.c[0] into three scheduled operands M.sub.s according to three 1 of the first multiplier (00000111.sub.bin). The three scheduled operands M.sub.s are 00001010, 00010100 and 00101000, respectively. The first one of the three scheduled operands M.sub.s is equal to the first multiplicand M.sub.c[0]. The first multiplicand M.sub.c[0] is left shifted by one bit to form the second one of the three scheduled operands M.sub.s. The first multiplicand M.sub.c[0] is left shifted by two bits to form the third one of the three scheduled operands M.sub.s, as shown in lines 1-3 of Table 1. Moreover, when a second multiplicand M.sub.c[1] (i.e., 15.sub.dec and 00001111.sub.bin) is multiplied by a second multiplier (i.e., 4.sub.dec and 00000100.sub.bin), the scheduler 200 arranges the second multiplicand M.sub.c[1] into one scheduled operand M.sub.s according to one 1 of the second multiplier (00000100.sub.bin). The scheduled operand M.sub.s is 00111100. In other words, the second multiplicand M.sub.c[1] is left shifted by two bits to form the scheduled operand M.sub.s, as shown in line 6 of Table 1. In addition, when a third multiplicand M.sub.c[2] (i.e., 3.sub.dec and 00000011.sub.bin) is multiplied by a third multiplier (i.e., 9.sub.dec and 00001001.sub.bin), the scheduler 200 arranges the third multiplicand M.sub.c[2] into two scheduled operands M.sub.s according to two 1 of the third multiplier (00001001.sub.bin). The two scheduled operands M.sub.s are 00000011 and 00011000, respectively. The first one of the two scheduled operands M.sub.s is equal to the third multiplicand M.sub.c[2]. The third multiplicand M.sub.c[2] is left shifted by three bits to form the second one of the two scheduled operands M.sub.s, as shown in lines 9 and 12 of Table 1.
(22) The self-accumulating adder 300 is signally connected to the scheduler 200. The self-accumulating adder 300 is configured to add the scheduled operands M.sub.s to generate a plurality of compressed operands S[n], C.sub.out[n], wherein n is an integer greater than or equal to 0. For example, the self-accumulating adder 300 sequentially performs four addition operations which includes a first addition operation, a second addition operation, a third addition operation and a fourth addition operation, as shown in equation (1) and Table 1. The first addition operation represents that the self-accumulating adder 300 adds three scheduled operands M.sub.s (i.e., M.sub.c[0]=00001010, M.sub.c[0](<<1)=00010100 and M.sub.c[0](<<2)=00101000) to generate two compressed operands S[0], C.sub.out[0], as shown in lines 4 and 5 of Table 1. The second addition operation represents that the self-accumulating adder 300 adds the two compressed operands S[0], C.sub.out[0] and a scheduled operand M.sub.s (i.e., M.sub.c[1](<<2)=00111100) to generate two compressed operands S[1], C.sub.out[1], as shown in lines 7 and 8 of Table 1. The third addition operation represents that the self-accumulating adder 300 adds the two compressed operands S[1], C.sub.out[1] and a scheduled operand M.sub.s (i.e., M.sub.c[2]=00000011) to generate two compressed operands S[2], C.sub.out[2], as shown in lines 10 and 11 of Table 1. The fourth addition operation represents that the self-accumulating adder 300 adds the two compressed operands S[2], C.sub.out[2] and a scheduled operand M.sub.s (i.e., M.sub.c[2](<<2)=00011000) to generate two compressed operands S[3], C.sub.out[3], as shown in lines 13 and 14 of Table 1.
(23) The adder 400 is signally connected to the output ports S, C.sub.out of the compressor 300 so as to add the two compressed operands S[3], C.sub.out[3] to generate the inner product Z, as shown in line 15 of Table 1. The adder 400 is implemented as a carry look-ahead adder, a carry propagate adder, a carry save adder or a ripple carry adder.
(24) In addition, a controlling processor 500 is disposed in the artificial neural network accelerator 110 and signally connected to the scheduler 200, the self-accumulating adder 300 and the adder 400. The controlling processor 500 is configured to control the scheduler 200, the self-accumulating adder 300 and the adder 400. The controlling processor 500 may be a central processing unit (CPU), a micro-control unit (MCU), or other control logic circuits. The artificial neural network accelerator 110 includes a plurality of layer processing modules (not shown). The controlling processor 500 is signally connected to the layer processing modules. The controlling processor 500 detects the layer processing modules. The controlling processor 500 generates a plurality of controlling signals and transmits the controlling signals to the scheduler 200, the self-accumulating adder 300 and the adder 400 according to a processed result of the layer processing modules so as to determine a schedule or stop an operation of the scheduler 200, the self-accumulating adder 300 and the adder 400. In another embodiment, the artificial neural network accelerator 110 includes a first layer processing module and a second layer processing module. The first layer processing module has a first layer output end. The second layer processing module has a second layer input end. The fast vector multiplication and accumulation circuit 100 is disposed between the first layer output end of the first layer processing module and the second layer input end of the second layer processing module to process an output signal of the first layer processing module. In addition, the fast vector multiplication and accumulation circuit 100 may be implemented as an application specific integrated circuit (ASIC) on a semiconductor process, and the semiconductor process includes a complementary metal-oxide-semiconductor (CMOS) process or a silicon on insulator (SOI) process. The fast vector multiplication and accumulation circuit 100 may be implemented as a field programmable gate array (FPGA). Therefore, the fast vector multiplication and accumulation circuit 100 of the present disclosure is suitable for use in the artificial neural network accelerator 110 and utilizes the self-accumulating adder 300 combined with application-specific integrated circuits (ASIC) to accomplish a fast inner product operation, thereby greatly reducing the computational complexity, latency and power consumption.
(25)
(26) The priority encoder 210 sequentially receives the multipliers of the multiplier vector M.sub.r. The priority encoder 210 determines at least one valid bit position of each of the multipliers. In other words, the priority encoder 210 determines a position of a value of each of the multipliers, and the value of each of the multipliers is equal to 1. The priority encoder 210 includes eight priority encoding input ports M.sub.0, M.sub.1, M.sub.2, M.sub.3, M.sub.4, M.sub.5, M.sub.6, M.sub.7, nine priority controlling signals P.sub.0, P.sub.1, P.sub.2, P.sub.3, P.sub.4, P.sub.5, P.sub.6, P.sub.7, P.sub.8, eight priority encoding output ports EP.sub.0, EP.sub.1, EP.sub.2, EP.sub.3, EP.sub.4, EP.sub.5, EP.sub.6, EP.sub.6, EP.sub.7 and a signal READY. The eight priority encoding input ports M.sub.0-M.sub.7 receive the multipliers of the multiplier vector M.sub.r. The nine priority controlling signals P.sub.0-P.sub.8 are inner signals of the priority encoder 210 and represent a priority status. The priority controlling signal P.sub.0 is equal to 1 (i.e., a logical true value). When one of the nine priority controlling signals P.sub.n is 0, the subsequent priority controlling signals P.sub.n+1-P.sub.8 cannot obtain the priority state. The priority encoder 210 includes nineteen AND gates and nine inverters, as shown in
(27) The structure of the barrel shifter 220a is the same as the structure of the barrel shifter 220b. The barrel shifter 220a includes a plurality of tri-state buffers, eight barrel shifting input ports x.sub.0, x.sub.1, x.sub.2, x.sub.3, x.sub.4, x.sub.5, x.sub.6, x.sub.7, eight barrel shifting output ports y.sub.0, y.sub.1, y.sub.2, y.sub.3, y.sub.4, y.sub.5, y.sub.6, y.sub.7 and eight barrel shifting control ports w.sub.0, w.sub.1, w.sub.2, w.sub.3, w.sub.4, w.sub.5, w.sub.6, w.sub.7, as shown in
(28) The five delay elements 230 and the four switch elements 240 are controlled by the controlling processor 500. The controlling processor 500 can generate control signals to allow the input ports and output ports of the priority encoder 210 and the barrel shifter 220a, 220b to correctly correspond to each other in time, thereby improving the efficiency of the pipeline. The delay elements 230 are configured to delay signals. The switch elements 240 are configured to determine to load a new multiplier vector M.sub.r and a new multiplicand vector M.sub.c into the scheduler 200 or to use a feedback path in the scheduler 200 to shift output signals of the barrel shifter 220a, 220b. In
(29)
(30) TABLE-US-00002 TABLE 2 X Y C.sub.in S C.sub.out 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 0 1 1 1 0 0 1 1 1 1 1 1
(31)
(32)
(33)
(34) In
(35)
(36) TABLE-US-00003 TABLE 3 X.sub.0 X.sub.1 X.sub.2 X.sub.3 X.sub.4 X.sub.5 X.sub.6 Y.sub.0 Y.sub.1 Y.sub.2 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 1 . . . 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 0 1 0 1 0 . . . 0 0 0 0 1 1 1 0 1 1 0 0 0 1 1 1 0 0 1 1 . . . 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 0 1 1 1 0 . . . 1 1 1 1 1 1 1 1 1 1
(37)
(38) The scheduling step S22 is for driving a scheduler 200 to arrange a plurality of multiplicands of the multiplicand vector M.sub.c into a plurality of scheduled operands M.sub.s according to a plurality of multipliers of the multiplier vector M.sub.r, respectively. In detail, the scheduling step S22 includes a priority encoding step S222 and a barrel shifting step S224. The priority encoding step S222 is for driving a priority encoder 210 (shown in
(39) The self-accumulating and adding step S24 is for driving a self-accumulating adder 300 (shown in
(40) The adding step S26 is for driving an adder 400 or an adder 400a to add the compressed operands S[n], C.sub.out[n] to generate an inner product Z. The adder 400 is shown in
(41) The activation step S28 is for driving an activation unit 600 (shown in
(42) TABLE-US-00004 TABLE 4 Hardware complexity Direct MAC Present disclosure # Full adders 80(64.Math.16) 32(1616) # FA time/matrix OP (Worst case) N N # FA time/matrix OP (Equal 0/1) N N/2
(43) According to the aforementioned embodiments and examples, the advantages of the present disclosure are described as follows.
(44) 1. The fast vector multiplication and accumulation circuit and the fast vector multiplication and accumulation method of the present disclosure utilize the self-accumulating adder combined with application-specific integrated circuits (ASIC) to accomplish a fast inner product operation, thereby greatly reducing the computational complexity, latency and power consumption. In addition, the fast vector multiplication and accumulation circuit and the fast vector multiplication and accumulation method of the present disclosure utilize a multi-bit compressor of the self-accumulating adder and a binary arithmetic coding of the scheduler to greatly enhance a level of vector parallelism of a long vector inner product operation.
(45) 2. The fast vector multiplication and accumulation circuit and the fast vector multiplication and accumulation method of the present disclosure are suitable for use in an inner product operation of the artificial neural network.
(46) 3. The fast vector multiplication and accumulation circuit and the fast vector multiplication and accumulation method of the present disclosure utilize the scheduling step combined with the self-accumulating and adding step to accomplish a fast inner product operation, thereby not only greatly reducing the computational complexity, latency and power consumption, but also reducing the chip area and the cost of production.
(47) Although the present disclosure has been described in considerable detail with reference to certain embodiments thereof, other embodiments are possible. Therefore, the spirit and scope of the appended claims should not be limited to the description of the embodiments contained herein.
(48) It will be apparent to those skilled in the art that various modifications and variations can be made to the structure of the present disclosure without departing from the scope or spirit of the disclosure. In view of the foregoing, it is intended that the present disclosure cover modifications and variations of this disclosure provided they fall within the scope of the following claims.