Sub-cell, Mac array and Bit-width Reconfigurable Mixed-signal In-memory Computing Module
20220351761 · 2022-11-03
Inventors
- Minhao Yang (Zürich, CH)
- Hongjie Liu (Shenzhen, Guangdong, CN)
- Alonso Morgado (Villach, AT)
- Neil Webb (Kilchberg, CH)
Cpc classification
G11C11/41
PHYSICS
Y02D10/00
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
H03M1/468
ELECTRICITY
H03M1/462
ELECTRICITY
G11C7/16
PHYSICS
International classification
G11C7/10
PHYSICS
Abstract
A mixed-signal in-memory computing sub-cell only requires 9 transistors for 1-bit multiplication. A computing cell is constructed from a plurality of such sub-cells that share a common computing capacitor and a common transistor. Also proposed is a MAC array for performing MAC operations, which includes a plurality of the computing cells each activating the sub-cells therein in a time-multiplexed manner. Also proposed is a differential version of the MAC array with improved computation error tolerance and an in-memory mixed-signal computing module for digitalizing parallel analog outputs of the MAC array and for performing other tasks in the digital domain. An ADC block in the computing module makes full use of capacitors in the MAC array, thus allowing the computing module to have a reduced area and suffer from less computation errors. Also proposed is a method of fully taking advantage of data sparsity to lower the ADC block's power consumption.
Claims
1. An in-memory mixed-signal computing sub-cell, configured for 1-bit multiplication, the in-memory mixed-signal computing sub-cell comprising a conventional six-transistor (6T) static random-access memory (SRAM) cell, a complementary transmission gate, a first NMOS transistor and a computing capacitor, wherein the conventional 6T SRAM cell consists of MOS transistors M.sub.1, M.sub.2, M.sub.3, M.sub.4, M.sub.5, M.sub.6, in which a complementary metal-oxide-semiconductor (CMOS) inverter consisting of the MOS transistors M.sub.1, M.sub.2 is cross-coupled to a CMOS inverter consisting of the MOS transistors M.sub.3, M.sub.4, the cross-coupled two CMOS inverters store a 1-bit filter parameter, and the MOS transistors M.sub.5, M.sub.6 serve as control switches for bit lines for reading and writing the filter parameter, the CMOS inverter consisting of the MOS transistors M.sub.1, M.sub.2 in the conventional 6T SRAM cell comprises an output connected to an input of a complementary transmission gate, the complementary transmission gate comprising an output connected to a drain of the first NMOS transistor, the first NMOS transistor has a source that is grounded, and a drain connected to a bottom plate of the computing capacitor, the complementary transmission gate comprises an NMOS transistor having a gate connected to an input signal, and a PMOS transistor having a gate connected to a complementary input signal at a same voltage level during computation as a signal input to a gate of the first NMOS transistor, a multiplication result of the input signal and the filter parameter is stored as a voltage on the bottom plate of the computing capacitor, and a plurality of said sub-cells are arranged to form a computing cell, the first NMOS transistor and the computing capacitor are shared among all the computing sub-cells in the computing cell.
2. The in-memory mixed-signal computing sub-cell of claim 1, wherein the sub-cells in the computing cell are activated in a time-multiplexed manner where the complementary input signal at the gate of the PMOS transistor in the complementary transmission gate of the sub-cells that are active at a given time is at the same voltage level as the signal to which the gate of the first NMOS transistor is connected.
3. A multiply-accumulate (MAC) array, configured for performing MAC operations and comprising the in-memory mixed-signal computing sub-cells of claim 2, wherein the MAC array comprises a plurality of computing cells, in each of which the outputs of the complementary transmission gates of all the sub-cells are connected to the bottom plate of the shared computing capacitor, wherein top plates of the computing capacitors in all the computing cells of each column are connected to a respective accumulation bus, and wherein a voltage on each accumulation bus corresponds to an accumulated sum of multiplication operation results of the respective column of the MAC array.
4. The MAC array of claim 3, further comprising a plurality of differential computing cells each comprising a differential complementary transmission gate, a differential computing capacitor and a first PMOS transistor, wherein in each of the computing cells in the MAC array, the output of the CMOS inverter consisting of the MOS transistors M.sub.3, M.sub.4 in each conventional 6T SRAM cell is connected to an input of a respective differential complementary transmission gate, and all the differential complementary transmission gates connected to the respective CMOS inverters each consisting of the MOS transistors M.sub.3, M.sub.4 are connected at outputs thereof to a drain of a respective first PMOS transistor, the drain of the respective first PMOS transistor is connected to a bottom plate of a respective differential computing capacitor, the respective first PMOS transistor has a source connected to VDD, wherein differential multiplication results are stored as bottom plate voltages of the respective differential computing capacitors, and wherein top plates of the differential computing capacitors of the differential computing cells in each column are connected to a respective differential accumulation bus.
5. The MAC array of claim 3, further comprising first CMOS inverters and differential computing capacitors, wherein in each of the computing cells in the MAC array, the outputs of all the complementary transmission gates are connected to an input of a respective one of the first CMOS inverters, and an output of the first CMOS inverter is connected to a bottom plate of a respective one of the differential computing capacitors, wherein differential multiplication results are stored as bottom plate voltages of the respective differential computing capacitors, and wherein top plates of the differential computing capacitors in each column are connected to a respective differential accumulation bus.
6. A bit-width reconfigurable mixed-signal in-memory computing module, comprising: the MAC array of any claim 3, wherein column-wise accumulation results of multiplication results in the MAC array are represented as analog voltages; a filter/ifmap block, configured for providing filter parameters or activations from computation in a previous layer, which are written into and stored in the MAC array; an ifmap/filter block, configured for providing an input to the MAC array, which is subject to MAC operations with the filter parameters or the activations from computation in the previous layer; an analog-to-digital conversion (ADC) block, configured for converting the analog voltages from the MAC array to digital representations; and a digital processing block, configured for performing at least multi-bit fusion, biasing, scaling or nonlinearity on an output of the ADC block, and wherein outputting results in a form of partial sums or activations are directly usable as an input to the next network layer.
7. The computing module of claim 6, wherein the ADC block is successive approximation register (SAR) ADCs of a binarily weighted capacitor array, each of the SAR ADCs comprising: a MAC digital-to-analog converter (DAC) consisting of the computing capacitors in a respective column of the MAC array; a SAR DAC, which is an array consisting of a plurality of binarily weighted capacitors and one redundant capacitor of a same capacitance as a least significant bit (LSB) capacitor therein; a comparator; a switching sequence; and SAR logic configured for controlling the switching sequence.
8. The computing module of claim 7, wherein an output voltage of the MAC DAC is taken as an input to one end of the comparator, and an output voltage of the SAR DAC is taken as an input to the other end of the comparator.
9. The computing module of claim 7, wherein an output voltage generated by the parallelly connected capacitors in the MAC DAC and in the SAR DAC is taken as an input to one end of the comparator, and a comparison voltage V.sub.ref is taken as an input to the other end of the comparator.
10. The computing module of claim 8, wherein one half-LSB capacitor is added to both positive V.sub.+ and negative V.sub.− input ends of the comparator, and wherein an output voltage of the MAC DAC and the half-LSB capacitor connected in parallel with the MAC DAC is taken as an input to the one end of the comparator, and an output voltage of the SAR DAC and the half-LSB capacitor connected in parallel with the SAR DAC is taken as an input to the other end of the comparator.
11. The computing module of claim 7, wherein the MAC DAC and a half-LSB capacitor are both connected to the switching sequence and reused as the SAR DAC, and an output voltage of the dual-use DAC is taken as an input to one end of the comparator, and wherein a comparison voltage V.sub.ref is taken as an input to the other end of the comparator.
12. The computing module of claim 7, wherein the SAR ADC further comprises a differential MAC DAC consisting of differential capacitors in a respective column of the MAC array.
13. The computing module of claim 12, wherein the MAC DAC and an additional LSB capacitor connected in parallel therewith are both connected to the switching sequence and reused as the SAR DAC, and an output voltage of the dual-use DAC is taken as an input to one end of the comparator, and wherein the differential MAC DAC and an additional differential LSB capacitor connected in parallel therewith are both connected to the switching sequence and reused as a differential SAR DAC, and an output voltage of the dual-use differential DAC is taken as an input to the other end of the comparator.
14. The computing module of claim 9, wherein a bit-width of the SAR ADC is determined in real time by the sparsity of input data and stored data in the computing array and expressed as ceil(log.sub.2(min(X,W)+1)), where ceil is a ceiling function, min is a minimum function, X is the number of 1 within a 1-bit input vector, W is the number of 1 stored in one column of the computing array, and the real-time bit-width calculation expression is equivalently implemented by digital combinatorial logic in circuitry.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0033]
[0034]
[0035]
[0036]
[0037]
[0038]
[0039]
[0040]
[0041]
[0042]
[0043]
[0044]
[0045]
[0046]
[0047]
[0048]
[0049]
[0050]
[0051]
DETAILED DESCRIPTION
[0052] The objects, principles, features and advantages of the present invention will become more apparent from the following detailed description of embodiments thereof, which is to be read in connection with the accompanying drawings. It will be appreciated that the particular embodiments disclosed herein are illustrative and not intended to limit the present invention, as also explained somewhere else herein.
[0053] It is particularly noted that, for the brevity of illustration, some connections or positional relationships that can be inferred from the text of this specification or the teachings disclosed herein are omitted in the figures, or not all positional changes are depicted. Such positional changes that are not clearly described or illustrated should not be considered as having not taken place. As collectively clarified here, this will not be explained separately in the following detailed description, for the sake of conciseness.
[0054] As a common application scenario, bit-width reconfigurable mixed-signal computing modules provided in embodiments of the present invention can be used in visual and acoustic DNN architectures, in particular in object detection, acoustic feature extraction with low power consumption, etc.
[0055] For example, in the case of feature extraction, a feature extractor convolves data to be processed with a filter consisting of weights and outputs feature maps. Depending on the filter selected, different features may be extracted. In this process, the convolution operation of the data to be processed with the filter is most power-consuming, necessitating the avoidance of power consumption in unconditional circuit driving or the like, in particular when the data to be processed is a sparse matrix.
[0056]
[0057] An output of the CMOS inverter consisting of the MOS transistors M.sub.1, M.sub.2 in the conventional 6T SRAM cell is connected to an input of the complementary transmission gate, and an output of the complementary transmission gate is connected to a drain of the first NMOS transistor.
[0058] A source of the first NMOS transistor is grounded, and the drain thereof is connected to a bottom plate of the computing capacitor.
[0059] During computation, in the complementary transmission gate of the sub-cell, a gate of an NMOS transistor is connected to an input signal and a gate of a PMOS transistor is at the same voltage level as a signal input to a gate of the first NMOS transistor.
[0060] A multiplication result of the input signal and the filter parameter is stored as a voltage on the bottom plate of the computing capacitor, and a plurality of such computing sub-cells are arranged to form a computing cell in such a manner that the first NMOS transistor and the computing capacitor are shared among all the sub-cells in the computing cell.
[0061] The input signals at the NMOS and PMOS gates of the complementary transmission gate are denoted at A and nA, and the signal at the gate of the first NMOS transistor are denoted at B. In particular, as shown in
[0062] Optionally, the sub-cell may follow the procedure below to perform a 1-bit multiplication operation:
[0063] 1. Reset a top plate voltage V.sub.top of the computing capacitor to V.sub.rst through a reset switch S.sub.rst on the accumulation bus.
[0064] 2. Conduct the first NMOS transistor in the sub-cell by raising the signal B at its gate to VDD, thus resetting the bottom plate voltage V.sub.btm of the capacitor to 0, keep the input signals A and nA in the complementary transmission gate of the sub-cell at 0 and VDD, respectively, and disconnect S.sub.rst after V.sub.btm is reset to 0.
[0065] 3. During computation, activate the input signals A and nA in the sub-cell according to the truth table of the 1-bit multiplication operation as shown in
[0066] 4. After the multiplication operation in the sub-cell is completed, either maintain the bottom plate voltage V.sub.btm of the computing capacitor at 0 or raise it to VDD, and output a result of the multiplication operation as the computing capacitor's bottom plate voltage V.sub.btm, expressed as VDD×w×A.
[0067] It is to be understood that the sub-cell accomplishes the 1-bit multiplication operation (of the filter parameter w and the input signal A) with only 9 transistors, thus having a reduced sub-cell area and higher energy efficiency. It would be appreciated that the first NMOS transistor is included in the sub-cell for the purpose of control, and the 1-bit multiplication operation result of the input signal A and the filter parameter w stored in the SRAM cell is stored as the bottom plate voltage of the computing capacitor. For ease of description, the structure in which the SRAM cell is connected to the complementary transmission gate and contains 8 transistors is referred to as an 8T structure (or 8T sub-cell, as it contains eight transistors). This computing sub-cell is an extended version of the conventional 6T SRAM cell. This promises better economic benefits in practical applications by standardized sub-cell structure and allows enhanced sub-cell scalability. Further, instead of connecting the complementary transmission gate to the top plate of the computing capacitor as conventionally practiced, connecting it to the bottom plate of the computing capacitor can minimize computation errors, in particular those caused by clock feedthrough introduced with MOS transistor switches, charge injection occurring during on-to-off switching, nonlinear parasitic capacitance at the drains/sources of the transistors in the complementary transmission gate, leakage of the transistors themselves, etc.
[0068] In order to further reduce the number of components in the sub-cell, in some embodiments, a plurality of the sub-cells may be arranged in a feasible shape such as 2×2, 4×2, etc., as shown in
[0069] Moreover, as the area of a single capacitor is generally as large as several times that of a 6T SRAM cell, compared to each 1-bit multiplication sub-cell being separately equipped with an individual capacitor for storing a computation result, the arrangement with multiple 1-bit multiplication sub-cells sharing a single capacitor can greatly improve the storage capacity per unit area. That is, more filter parameters or weights can be stored per area, compared to the conventional techniques.
[0070] Additionally, the sub-cells in the computing cell may be activated in a time-multiplexed manner. That is, when any sub-cell is activated, all the other sub-cells are deactivated. The activated sub-cell can perform a 1-bit multiplication operation in the way as described above according to the truth table in
[0071] It will be appreciated that the above sharing arrangement is also applicable to any other sub-cell structure than the conventional 6T SRAM, which can equally perform the functions of storing and reading 1-bit filter parameters.
[0072] In a second aspect, a multiply-accumulate (MAC) array for MAC operation is provided on the basis of the sub-cells according to the first aspect and possible implementations thereof. Referring to
[0073] In contrast to an MAC array constructed from individual sub-cells, the computing cells with shared capacitors and transistors enable the MAC array to store more neural network parameters or computation results from the previous network layer. Specifically, the results of 1-bit multiplication operations in the computing cells are stored in the computing capacitors, and the 1-bit multiplication results from the computing cells of each column in the MAC array are accumulated by the respective accumulation bus to which the top plates of the computing capacitors are connected.
[0074] For in-memory computing, reducing data movement between inside and outside of the computing chip is a direct way to reduce power consumption. It will be appreciated that this design allows the MAC array to contain more SRAM cells per unit area, which can store more filter parameters compared to conventional techniques. In each cell, after an in-memory computation is completed in one sub-cell, another in-memory computation can be immediately initiated with a filter parameter stored in another sub-cell of the same cell without waiting for the transfer of data from outside into the SRAM. This results in enhancements in throughput, and results in reductions in power consumption and area consumption. As the area of a computing capacitor is typically several times that of a conventional 6T SRAM cell, reducing the number of capacitors in each computing cell can improve the array's throughput and reduce its power consumption.
[0075] Referring to
[0076] In some embodiments, the MAC array may follow “Procedure I” below to perform a MAC operation:
[0077] 1. First write filter parameters (or activations from computation in the previous network layer) into the sub-cells following the 6T SRAM write procedure.
[0078] 2. Reset the top plate voltage V.sub.top of the computing capacitors to V.sub.rst, which may be 0, through a reset switch S.sub.rst on the accumulation bus.
[0079] 3. Reset the bottom plate voltages V.sub.btmi of the computing capacitors to 0 by raising the signal B.sub.i in every computing cell to VDD, keep the signals A.sub.ij and nA.sub.ij in every computing cell at 0 and VDD, respectively, and disconnect S.sub.rst.
[0080] 4. During computation, activate the signals A.sub.ij and nA.sub.ij in a time-multiplexed manner. For example, when A.sub.0a and nA.sub.0a are activated, A.sub.0j and nA.sub.0j (j=b, c, d) are deactivated, i.e. kept at 0 and VDD, respectively. It is to be noted that, during computation, B.sub.0 in one computing cell is at the same voltage level as nA.sub.0j in the then activated sub-cell.
[0081] 5. After the multiplication in each computing cell in a column is completed, either maintain the bottom plate voltages V.sub.btmi of the computing capacitors at 0, or raise them to VDD. Charge redistribution occurs in the computing capacitors in the column, similar to the charge distribution in capacitors of a successive approximation register (SAR) digital-to-analog converter (DAC), and when not considering non-idealities such as parasitic capacitance and so on, the analog output voltage V.sub.top of the computing capacitors in the column represents the accumulation result expressed in the equation below, as shown in
[0082] In other embodiments, the MAC array may follow “Procedure II” below to perform an operation:
[0083] 1. Write filter parameters (or activations from computation in the previous network layer) into the sub-cell.
[0084] 2. Reset the top plate voltage V.sub.top of the computing capacitors to V.sub.rst through a reset switch S.sub.rst on the accumulation bus. S.sub.rst keeps the connection between V.sub.top and V.sub.rst.
[0085] 3. Reset the bottom plate voltages V.sub.btmi of the computing capacitors to 0 by raising the signal B.sub.i in every cell to VDD and keep the signals A.sub.ij and nA.sub.ij in every cell at 0 and VDD, respectively.
[0086] 4. During computation, activate the signals A.sub.ij and nA.sub.ij in a similar time-multiplexed manner.
[0087] 5. After the multiplication in each computing cell in a column is completed, either maintain the bottom plate voltages V.sub.btmi of the computing capacitors at 0, or raise them to VDD, and then disconnect S.sub.rst. With the bottom plate voltages V.sub.btmi being set to 0 or VDD, MOS switches in control means of the computing cells run a successive approximation algorithm for analog-to-digital conversion. As an example, if V.sub.btmi are all set to 0, the voltage V.sub.top can be expressed as:
[0088] where W.sub.ij represents the filter parameter in the j-th sub-cell in the i-th computing cell.
[0089] The MAC array may be in particular used in computation with multi-bit weights. In these cases, each column of computing cells performs a bit-wise MAC operation, and the multi-bit computation results can be obtained by performing shift-add operations on digital representations resulting from analog-to-digital conversion. For example, in the case of k-bit weights or filter parameters, each column may perform a bit-wise MAC operation, e.g., the first column for the least significant bit (LSB) (i.e., performing a MAC operation between the 0-th bit values and the input signals) and the k-th column for the most significant bit (MSB) (i.e., performing a MAC operation between the k-th bit values and the input signals). It will be appreciated that each column separately performs an MAC operation for one bit of multi-bit binary weights, and the MAC results of all the involved columns contain k elements, which are then subject to analog-to-digital conversion and shift-add operations in the digital domain.
[0090] A differential version of the MAC array architecture may be used to reduce computation errors. In some embodiments, the MAC array further includes differential complementary transmission gates, differential computing capacitors and first PMOS transistors. In each computing cell of the MAC array, the output of the CMOS inverter consisting of the MOS transistors M.sub.3, M.sub.4 in each conventional 6T SRAM cell is connected to an input of a respective one of the differential complementary transmission gates, and all these differential complementary transmission gates connected to the respective CMOS inverters each consisting of the MOS transistors M.sub.3, M.sub.4 are connected at their outputs to a drain of a respective one of the first PMOS transistors. The drain of the respective first PMOS transistor is in turn connected to a bottom plate of a respective one of the differential computing capacitors, and a source thereof is connected to VDD. Differential multiplication results are stored as bottom plate voltages of the respective differential computing capacitors, and top plates of the differential computing capacitors of the differential computing cells in each column are connected to a respective differential accumulation bus. For ease of description, referring to
[0091] In some other embodiments of the differential MAC array architecture, the MAC array may further include first CMOS inverters and differential computing capacitors. In each computing cell of the MAC array, the outputs of all the complementary transmission gates are connected to an input of a respective one of the first CMOS inverter, and an output of the respective first CMOS inverter is connected to a bottom plate of a respective one of the differential computing capacitors. Likewise, referring to
[0092] It is to be noted that both the first and second differential cells are extensions of the above-discussed computing cells, and their naming is intended only to facilitate description of the circuit structures.
[0093] In a third aspect, there is provided a bit-width reconfigurable mixed-signal computing module. Referring to
[0094] It will be appreciated that, the module described herein, when used in a neural network to perform MAC operations, may be typically able to pre-load the necessary filter parameters (weights) at once because it contains more memory elements (i.e., 6T SRAM cells) per unit area. After computation in one layer is completed, the output partial sums or final activations (feature maps) directly usable in computation in the next network layer can be immediately subject to MAC operations with the filter parameters (weights) pre-loaded and stored in the module, saving the time waiting for off-chip data movement and the power consumed therein as well. In addition, the high throughput of the module can improve on-chip storage capabilities. For example, apart from the filter parameters, the memory cells in the MAC array can also be used to store the output activations (feature maps) of the same network layer.
[0095] It will be appreciated that, in addition to the sharing of transistors and computing capacitors within the computing cells and MAC array as described above in the first and second aspects, in fact, the computing cells also share some transistors and other devices involved in the analog-to-digital conversion and digital processing in other regions of the module than the MAC array.
[0096] According to the present invention, the ADC block may be parallel capacitive SAR ADCs for converting the top plate voltages V.sub.top column-wise output from the computing cells to their digital representations. Each of the SAR ADCs may include a MAC DAC, a SAR DAC, a comparator, a switching sequence and SAR logic for controlling the switching sequence. Compared to SAR ADCs of other types such as resistive and hybrid resistive-capacitive, the parallel capacitive SAR ADCs allow more full utilization of the inventive structures, resulting in a reduced number of components and a reduced area. The MAC DAC is composed of the parallel capacitors in a respective column of computing cells in the MAC array. It will be appreciated that the output voltage of the MAC DAC is V.sub.top. The SAR DAC includes (B+1) parallel capacitors and B=log.sub.2 N, where N is the number of capacitors in the MAC DAC. The capacitors include B capacitors with capacitances binarily decreasing from an MSB one to an LSB one and a redundant capacitor of the same capacitance as the LSB capacitor. As an example, when the number of capacitors in the MAC DAC is N=8, then B=3, the capacitance of the MSB capacitor C.sub.B-1 is C, the capacitance of the second MSB capacitor C.sub.B-2 is C/2, and the capacitance of the LSB capacitor C.sub.0 is C/4. In this case, a reference voltage of the SAR DAC is allocated to the MSB to LSB capacitors respectively at ratios of 1/2, 1/4, 1/8, and the capacitance of the redundant capacitor C.sub.U is C/4. The B capacitors and the redundant capacitor are connected in parallel at one end, and the other ends of the B capacitors are connected to the switching sequence, with the other end of the redundant capacitor being always grounded. A free end of the switching sequence includes a VDD terminal and a ground terminal. The SAR logic controls the switching sequence.
[0097] In one embodiment, as shown in
[0098] In another embodiment, referring to
[0099] In the two embodiments shown in
[0100] Another embodiment allows the reuse of the MAC DAC as the SAR DAC via bottom-plate sampling. As shown in
[0101]
[0102] In one embodiment, the SAR ADC for each column has a bit-width that is determined in real time by the sparsity of input data and values stored in the column. In this way, the number of capacitors in the binarily weighted capacitor array that need to be charged or discharged during analog-to-digital conversion may be greatly reduced on average, thus significantly reducing the power consumed during analog-to-digital conversion. In particular, as shown in
[0103] It is worth noting that the boundaries of the various blocks and modules included in the foregoing embodiments have been defined only based on their functional logic, and the present invention is not so limited, as alternate boundaries can be defined as long as the specified functions are appropriately performed. Also, specific names of the various functional components are intended to distinguish between these components rather than limit the scope of the present invention in any way.
[0104] The foregoing description presents merely preferred embodiments of the present invention and is not intended to limit the scope of the present invention in any way. Any and all changes, equivalent substitutions, modifications and the like made within the spirit and principles of the present invention are intended to be embraced in the scope thereof.