Compute-In-Memory architecture using look-up tables
20250104762 ยท 2025-03-27
Assignee
Inventors
Cpc classification
G11C11/4096
PHYSICS
International classification
G11C11/4096
PHYSICS
Abstract
A Compute-In-Memory (CIM) architecture includes a plurality of memories and a plurality of processing elements. Each of the processing elements has a look-up table unit configured to store values of a look-up table of a k-cluster residue number system, and output one of the values of the look-up table according to a first remainder and a second remainder as a result of a residue calculation. The look-up table unit receives the first remainder and the second remainder from one of the memories.
Claims
1. A Compute-In-Memory (CIM) architecture, comprising: a plurality of memories; and a plurality of processing elements, each of the processing elements comprising: a look-up table unit for storing values of a look-up table of a k-cluster residue number system, and outputting one of the values of the look-up table according to a first remainder and a second remainder as a result of a residue calculation, wherein the look-up table unit receives the first remainder and the second remainder from one of the memories.
2. The Compute-In-Memory architecture of claim 1, wherein each of the processing elements further comprises: a local memory for temporarily storing data; an address register for reading the first remainder from the local memory according to a first address and reading the second remainder from the local memory according to a second address; a first register for storing the first remainder; and a second register for storing the second remainder.
3. The Compute-In-Memory architecture of claim 2, wherein each of the processing elements further comprises a zero detector, coupled between the look-up table unit and the first register and the second register, for detecting whether any of the first remainder and the second remainder is zero.
4. The Compute-In-Memory architecture of claim 1, wherein the plurality of processing elements is formed in a plurality of first layers, the plurality of memories are formed in a plurality of second layers, and the first layers and the second layers are alternatively staked.
5. The Compute-In-Memory architecture of claim 1, wherein the processing elements form a ten-way switch, and each central processing element of the ten-way switch is linked to the other ten processing elements of the processing elements.
6. The Compute-In-Memory architecture of claim 5, wherein the central processing element of the ten-way switch is connected to the other ten processing elements in directions of up, down, north, south, east, west, north-east, north-west, south-east, and south-west.
7. The Compute-In-Memory architecture of claim 1, wherein the look-up table unit is further used for storing an addition and subtraction look-up table, the addition and subtraction look-up table comprising 2 m.sub.i cells for recording values from zero to (m.sub.i1) in an ascending order twice, wherein m.sub.i is a coprime integer of a modular set of the k-cluster residue number system.
8. The Compute-In-Memory architecture of claim 7, wherein look-up table unit retrieves a value recorded in a cell at position R of the addition and subtraction look-up table when performing an addition operation on the first remainder and the second remainder, where Q=(rx+ry), rx is the first remainder, ry is the second remainder, and the first remainder and the second remainder are integers.
9. The Compute-In-Memory architecture of claim 7, wherein look-up table unit retrieves a value recorded in a cell at position R of the addition and subtraction look-up table when subtracting the first remainder by the second remainder, where R=(m.sub.i+rxry), rx is the first remainder, ry is the second remainder, and the first remainder and the second remainder are integers.
10. The Compute-In-Memory architecture of claim 1, wherein the look-up table unit is further used for storing a multiplication look-up table for a coprime integer m.sub.i of a modular set of the k-cluster residue number system, the coprime integer m.sub.i is not 2, and the multiplication look-up table is composed of S cells,
11. The Compute-In-Memory architecture of claim 1, wherein the look-up table unit is further used for storing an overflow table for overflow detection of addition-subtraction operations and multiplication operations.
12. The Compute-In-Memory architecture of claim 1, wherein each memory comprises a decoder configured to decode an analog signal outputted from each multi-value analog memory into at least two bits, and the decoder supports multiple bits decoding.
13. The Compute-In-Memory architecture of claim 12, wherein the decoder comprises: a first inverter, having an input end for receiving the analog signal; a second inverter, having an input end for receiving the analog signal; and a third inverter, having an input end for receiving the analog signal, wherein a first threshold voltage of the first inverter is less than a second threshold voltage of the second inverter, and the second threshold voltage of the second inverter is less than a third threshold voltage of the third inverter.
14. The Compute-In-Memory architecture of claim 13, wherein the first inverter comprises a PMOS transistor and an NMOS transistor, gates of the PMOS transistor and the NMOS transistor receive the analog signal, and drains of the PMOS transistor and the NMOS transistor are coupled to an output end of the first inverter; wherein the second inverter comprises a PMOS transistor and two cascaded NMOS transistors, gates of the two cascaded NMOS transistors and a gate of the PMOS transistor of the second inverter receive the analog signal, and a drain of the PMOS transistor of the second inverter and an end of the two cascaded NMOS transistors are coupled to an output end of the second inverter; and wherein the third inverter comprises a PMOS transistor and three or more cascaded NMOS transistors.
15. The Compute-In-Memory architecture of claim 13, wherein the decoder further comprises: a fourth inverter, having an input end coupled to an output end of the second inverter; a NOR gate, having a first input end coupled to an output end of the fourth inverter, and a second input end coupled to an output end of the first inverter; a fifth inverter, having an input end coupled to an output end of the NOR gate; and a NAND gate, having a first input end coupled to an output end of the third inverter, and a second input end coupled to an output end of the fifth inverter; wherein the fourth inverter is configured to output a first bit of the at least two bits outputted from the decoder, and the NAND gate is configured to output a second bit of the at least two bits outputted from the decoder.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0007]
[0008]
[0009]
[0010]
[0011]
[0012]
[0013]
[0014]
[0015]
[0016]
[0017]
[0018]
[0019]
[0020]
[0021]
DETAILED DESCRIPTION
[0022] The present invention provides a new Compute-In-Memory (CIM) architecture that uses the look-up table approach and can be fully integrated with memory. All computations are performed using the look-up table rather than logic gates. This approach also supports analog Compute-In-Memory and replaces the output ADC with a decoder to generate the memory address to access another memory look-up table. This is a true Compute-In-Memory architecture that not only simplifies the design but also improves overall performance with less power dissipation.
[0023] A look-up table (LUT) is a fast way to realize a complex function in digital logic. The address is the function input, and the value at that address is the function output. The advantage is that computing the function only takes a single memory look-up regardless of the complexity of the function, so it is very fast.
[0024] Please refer to
[0025] All computations of the CIM architecture 50 are performed by the look-up table units 76 of the processing elements 62 rather than logic gates. Accordingly, as compared with the conventional CIM architecture 10 shown in
[0026] Each of the processing elements 62 may further comprise a local memory 68, an address register 64, a first register 70, and a second register 72. The local memory 68 is configured to temporarily store data. The address register 64 is configured to read the first remainder rx from the local memory 68 according to a first address A1 and read the second remainder ry from the local memory 68 according to a second address A2. The first register 70 is configured to store the first remainder rx. The second register 72 is configured to store the second remainder ry.
[0027] The CIM architecture 50 is used to establish a Residue Number System (RNS). The RNS is a number system, which first defines a modular set and transforms the numbers to their integer remainders (also called residues) through modulo division, then performs the arithmetic operations (addition, subtraction, and multiplication) on the remainders only. For example, the modular set is defined as (7, 8, 9) with the numbers 13 and 17. The dynamic range is defined by the product of the modular set with the range 504, i.e., 789. It first transforms the numbers to their residue through modulo operations 13.fwdarw.(6, 5, 4) and 17.fwdarw.(3, 1, 8), then performs addition and multiplication on residues only, (6, 5, 4)+(3, 1, 8)=(9, 6, 12).fwdarw.(2, 6, 3), which is equal to 30. (6, 5, 4)*(3, 1, 8)=(18, 5, 32).fwdarw.(4, 5, 5), which is equal to 221. Since the remainder magnitude is much smaller, it only requires simple logic for parallel computations.
[0028] For the sake of clarity, the dynamic range of the RNS may be defined as the following equation (1):
where: [0029] M is the dynamic range of the RNS; and [0030] mi is the ith modulus of the modular set (m0, m1, . . . , mn) of the RNS.
[0031] All the arithmetic operations of the RNS of the CIM architecture 50 can be implemented using look-up tables 78 for parallel distributed computing. However, the memory requirement may be an important issue in using look-up tables in the RNS. The required size of memory is dependent on the square of each modulus as well as the number of bits of the modulus and can be presented as the following equation (2):
where: [0032] Mem is the required size of memory; [0033] mi is the ith modulus; and [0034] bi is the number of bits of the ith modulus.
[0035] For example, it chooses the RNS modular set as (15, 17) with the dynamic range M=1517=255. Since the first modulus (i.e., 15) has a 4-bit length and the second modulus (i.e., 17) has a 5-bit length, for all three arithmetic (i.e., addition, subtraction, and multiplication) operations, the total memory requirement is estimated to be 3(15.sup.24+17.sup.25)=7035 bits. The area is too large compared with the logic gate design (e.g., the processing elements (PEs)) of the RNS.
[0036] To represent an n-bit integer and its negative using a k-cluster residue number system (k-RNS) of the CIM architecture 50, it first defines a modular set of P coprime integers as (m.sub.0, m.sub.1, . . . , m.sub.p) where a dynamic range is generated according to the product of the modular set (m.sub.0, m.sub.1, . . . , m.sub.p). For example, when a modular set of 3 coprime integers is chosen to be (2.sup.n/21, 2, 2.sup.n/2+1), the dynamic range is set to [(2.sup.n1), (2.sup.n2)]. The modular set is not limited to 3 coprime integers.
[0037] The processing element 62 uses one of the look-up tables 78 to perform addition operations and subtraction operations. The look-up table 78 used by the processing element 62 to perform addition and subtraction operations may be called the addition and subtraction look-up table. According to the division algorithm, the following equation (3) in an integral domain could be transformed into the following equation (4) in a remainder domain:
where: [0038] X, Y, and Z are three integers; [0039] rx is equal to (X mod m.sub.i); [0040] ry is equal to (Y mod m.sub.i); [0041] rz is equal to (Z mod m.sub.i); and [0042] m.sub.i is selected from the modular set of the k-cluster residue number system.
[0043]
[0044] Similarly, according to the subtraction algorithm, the following equation (5) in the integral domain could be transformed into the following equation (6) in the remainder domain:
[0045] Therefore, the addition and subtraction look-up table 80 could be used for addition operations and subtraction operations. A sum of the remainder rx (i.e., the minuend) and the modulus m.sub.i may be used as the table entry, and the remainder ry (i.e., the subtrahend) may be used as the table index in descending order to retrieve the result. In the embodiment, a start position Ps would be set to be equal to the sum of the minuend rx and the modulus m.sub.i, i.e. (rx+m.sub.i), and the start position Ps would be shifted according to the subtrahend ry. In detail, when the subtraction operation on two integers X and Y is performed, a value recorded in cell 81 at position R of the addition and subtraction look-up table 80 is retrieved as the result of the subtraction operation, where R=(m.sub.i+ (X mod m.sub.i)(Y mod m.sub.i))=(m.sub.i+rx ry). Since m.sub.i=7, (X mod m.sub.i)=(X mod 7)=rx, and (Y mod m.sub.i)=(Y mod 7)=ry, R=(m.sub.i+rxry). For example, when we subtract 5 with 4, rx=5 and ry=4, the table entry becomes (5+7)=12, where 7 is equal to m.sub.i, then, it shifts to the left by 4, the value recorded in the cell 81 at position 8 is defined as 1, it matches the result of (rx-ry)=1.
[0046]
[0051] According to equations (2) and (7), the look-up table size is reduced from
to
Moreover, according to the multiplication commutative rule, the order of the multiplicand and multiplicator can be exchanged. Therefore, the products of the temporary multiplication look-up table 90 could be mirrored along a Top-Left and Bottom-Right (TL/BR) diagonal line 92 and along with a Top-Right and Bottom-Left (TR/BL) diagonal line 94, as shown in
to
[0052] Accordingly, the data amount of the multiplication look-up table 100 could be presented as the following equation (8):
where: [0053] Mem3 is the data amount of the multiplication look-up table 100; [0054] m.sub.i is the i.sup.th modulus; and [0055] b.sub.i is the number of bits of the i.sup.th modulus.
[0056] Please refer to
[0057] The look-up tables 78 of the look-up table unit 76 may include an overflow table, and the processing element 62 uses the overflow table to detect overflow of addition-subtraction operations and multiplication operations. Unlike the binary system, it is not required to extend the binary number to accommodate the overflow. This approach using the overflow table can keep the lookup table size. More information about the overflow detection may refer to U.S. patent application Ser. No. 17/878,235, filed on Nov. 27, 2022.
[0058] The approach of the present invention supports analog Compute-In-Memory and replaces the output ADC with a decoder to generate the memory address to access another memory look-up table. In an embodiment of the present invention, each of the memories 60 may be an analog memory and comprises a decoder 110 shown in
[0059]
TABLE-US-00001 Component of the decoder 118 124 112 114 116 (B1) 120 122 (B2) Vin > Vt3 0 0 0 1 0 1 1 Vt3 > Vin > Vt2 1 0 0 1 0 1 0 Vt2 > Vin > Vt1 1 1 0 0 1 0 1 Vt1 > Vin 1 1 1 0 0 1 0
[0060] Since the inverters 112, 114 and 116 have different threshold voltages Vt3, Vt2, and Vt1, the decoder 110 could decode the analog signal Vin output from one of the memories 60. The threshold voltages Vt3, Vt2, and Vt1 could be adjusted by changing the transistor size or cascading two or more transistors together to alter the saturation current thereof.
[0061] Since the NMOS transistors, Q21 and Q22 are cascaded, the threshold voltage Vt2 of the inverter 114 would be greater than the threshold voltage Vt1 of the inverter 116. Similarly, the inverter 112 of the decoder 110 may be implemented by replacing the NMOS transistor Q2 with three or more cascaded NMOS transistors, such that the threshold voltage Vt3 of the inverter 112 would be greater than the threshold voltages Vt1 and Vt2.
[0062] The processing elements 62 of the CIM architecture 50 may form a six-way or a ten-way neural network.
[0063] The present invention introduces a new Compute-In-Memory (CIM) architecture that utilizes a look-up table approach and can be fully integrated with memory. All computations are performed using look-up tables instead of logic gates. The CIM architecture of the present invention not only simplifies the design but also improves overall performance while reducing power dissipation.
[0064] Those skilled in the art will readily observe that numerous modifications and alterations of the device and method may be made while retaining the teachings of the invention. Accordingly, the above disclosure should be construed as limited only by the metes and bounds of the appended claims.