COMPACT AND PVT-ROBUST PROCESSING-IN-MEMORY MACRO WITH ACCURATE ANALOG SHIFT-AND-ADD
20250156149 ยท 2025-05-15
Assignee
Inventors
Cpc classification
G06F17/16
PHYSICS
International classification
G06F7/527
PHYSICS
Abstract
A processing-in-memory (PIM) macro device and a method are disclosed. The PIM macro device includes a plurality of capacitor-based digital-to-analog converters (C-DACs) and a plurality of multiply-and-add (MAC) units. Each MAC unit includes a plurality of slices, where each slice comprises a plurality of clusters, and where each cluster includes a 6-transitor (6T) static random-access memory (SRAM) cell and a MAC module. Each MAC unit further includes a partial-sum combiner (P-Sum Combiner), an analog-to-digital converter (ADC), and a Share Line, a MAC Line, a plurality of wordlines (WLs), and a local bitline (LBL). The PIM macro device further includes an array of metal-oxide-metal (MOM) capacitors, where the MOM capacitors are shared between the C-DACs and the MAC units, an array of switches configured to be controlled to configure the MOM capacitors to perform a first operation and to reconfigure the MOM capacitors to perform a second operation.
Claims
1. A processing-in-memory (PIM) macro device comprising: a plurality of capacitor-based digital-to-analog converters (C-DACs), wherein the C-DACs transform a digital input into an analog voltage; a plurality of multiply-and-add (MAC) units, each MAC unit comprising: a plurality of slices, wherein each slice comprises a plurality of clusters, wherein each cluster in the plurality of clusters comprises a 6-transitor (6T) static random-access memory (SRAM) cell and a MAC module; a partial-sum combiner (P-Sum Combiner) that performs a shift-and-add operation across multiple slices within the MAC unit; an analog-to-digital converter (ADC) configured to convert a final output voltage from the P-Sum Combiner into a digital output; and a Share Line, a MAC Line, a plurality of wordlines (WLs), and a local bitline (LBL); an array of metal-oxide-metal (MOM) capacitors configured to store a charge, each capacitor comprising a top plate and a bottom plate, wherein the MOM capacitors are shared between the C-DACs and the MAC units; and an array of switches configured to be controlled to configure the MOM capacitors to perform a first operation and to reconfigure the MOM capacitors to perform a second operation.
2. The PIM macro device of claim 1, wherein the plurality of C-DACs are integrated in-situ with the plurality of MAC units, wherein the ADC comprises a time-domain ADC.
3. The PIM macro device of claim 1, wherein the array of switches reconfigure the MOM capacitors to perform a pre-charging operation comprising: setting the top plate of the MOM capacitors to a ground voltage; setting the MAC Line to a VDD voltage; and setting the Share Line to a ground voltage.
4. The PIM macro device of claim 1, wherein the array of switches reconfigure the MOM capacitors to perform a digital-to-analog operation comprising: if a bit value of the digital input is equal to 1: setting the top plate of the MOM capacitors to a VDD voltage; if a bit value of the digital input is equal to 0: setting the top plate of the MOM capacitors to a ground voltage; sharing a charge stored in the top plate of the MOM capacitors between one or more MAC modules using the Share Line; and setting the bottom plate of the MOM capacitors to a ground voltage using the MAC Line.
5. The PIM macro device of claim 1, wherein the array of switches reconfigure the MOM capacitors to perform a multiplication operation comprising: activating one of the plurality of WLs; and setting a voltage of the MOM capacitors based on a value of a weight stored in the 6T SRAM cell.
6. The PIM macro device of claim 1, wherein the array of switches reconfigure the MOM capacitors to perform an accumulation operation comprising: setting the top plate of the MOM capacitors to a ground voltage; and sharing a charge stored in the MOM capacitors between one or more MAC modules using the MAC Line.
7. The PIM macro device of claim 1, wherein the array of switches reconfigure the MOM capacitors to perform a shift-and-add operation comprising: disconnecting one or more MAC Lines; connecting one or more MAC modules using the P-Sum Combiner; and transmitting the final output voltage to the ADC.
8. The PIM macro device of claim 1, wherein the ADC comprises a voltage-to-time converter (VTC), a Time-to-Digital Converter (TDC), and a ring oscillator (RO).
9. The PIM macro device of claim 1, wherein the array of switches comprises: a first switch (S.sub.CH) shared across one or more MAC modules using the MAC Line; a second switch (S.sub.RT) shared across the one or more MAC modules using the Share Line; a third switch (S.sub.SL) disposed within the MAC module; a fourth switch (S.sub.SA) configured to disconnect the MAC line; a fifth switch (K.sub.1) switch disposed within the MAC module and controlled by a bit value of the digital input; a sixth switch (M.sub.1) disposed within the MAC module and controlled by the LBL; and a seventh switch (S.sub.G) connected to the LBL and a global bitline (GBL).
10. The PIM macro device of claim 1, wherein the array of switches comprises an N-channel metal-oxide semiconductor (NMOS), a p-channel metal-oxide semiconductor (PMOS), or a transmission gate.
11. The PIM macro device of claim 1, wherein each MAC unit comprises a shift-and-add circuit.
12. The PIM macro device of claim 1, wherein each of the plurality of MAC units performs vector-vector multiplication, wherein the PIM macro device performs matrix-vector multiplication.
13. The PIM macro device of claim 1, wherein the PIM macro device comprises a global bit line (GBL), control line drivers, and SRAM read and write periphery circuits.
14. The PIM macro device of claim 1, wherein each cluster stores a weight in the 6T SRAM cell and activates one of the plurality of WLs during one or more operations.
15. The PIM macro device of claim 1, wherein each MAC unit comprises a dummy p-channel metal-oxide semiconductor (PMOS) with a drain and a source, wherein each MAC unit comprises a thin-cell layout, wherein the PIM macro device is fabricated using complementary metal-oxide semiconductor (CMOS) technology.
16. A method for operating a processing-in-memory (PIM) macro device, comprising: transforming a digital input into an analog voltage using a plurality of capacitor-based digital-to-analog converters (C-DACs), wherein the C-DACs comprise an array of metal-oxide-metal (MOM) capacitors configured to store a charge, each capacitor comprising a top plate and a bottom plate, wherein the MOM capacitors and are shared between the C-DACs and a plurality of PIM multiply-and-add (MAC) units; controlling an array of switches to configure the MOM capacitors to perform a pre-charging operation comprising: setting the top plate of the MOM capacitors to a ground voltage; setting a MAC Line to a VDD voltage; and setting a Share Line to a ground voltage; and controlling the array of switches to reconfigure the MOM capacitors to perform a digital-to-analog operation comprising: setting the top plate of the MOM capacitors to a voltage determined based on a bit value of the digital input; sharing a charge stored in the top plate of the MOM capacitors between one or more MAC modules using the Share Line; and setting the bottom plate of the MOM capacitors to a ground voltage using the MAC Line.
17. The method of claim 16, further comprising: controlling the array of switches to reconfigure the MOM capacitors to perform a multiplication operation between the analog voltage and a weight stored in a 6-transitor (6T) static random-access memory (SRAM) cell, the multiplication operation comprising: activating one of a plurality of wordlines (WLs) in the 6T SRAM cell; and setting a voltage of the MOM capacitors based on a value of a weight stored in the 6T SRAM cell.
18. The method of claim 16, further comprising: controlling the array of switches to reconfigure the MOM capacitors to perform an accumulation operation comprising: setting the top plate of the MOM capacitors to a ground voltage; and sharing the charge stored in the MOM capacitors between one or more MAC modules using the MAC Line.
19. The method of claim 16, further comprising: controlling the array of switches to reconfigure the MOM capacitors to perform a shift-and-add operation comprising: disconnecting one or more MAC Lines; and connecting one or more MAC modules using a P-Sum Combiner.
20. The method of claim 19, further comprising: obtaining a final output voltage from the P-Sum Combiner; transmitting the final output voltage to an analog-to-digital converter (ADC); and converting the final output voltage into a digital output.
Description
BRIEF DESCRIPTION OF DRAWINGS
[0007] Specific embodiments of the disclosed technology will now be described in detail with reference to the accompanying figures. Like elements in the various figures are denoted by like reference numerals for consistency.
[0008]
[0009]
[0010]
[0011]
[0012]
[0013]
[0014]
[0015]
[0016]
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
[0023]
[0024]
DETAILED DESCRIPTION
[0025] In the following detailed description of embodiments of the disclosure, numerous specific details are set forth in order to provide a more thorough understanding of the disclosure. However, it will be apparent to one of ordinary skill in the art that the disclosure may be practiced without these specific details. In other instances, well-known features have not been described in detail to avoid unnecessarily complicating the description.
[0026] Throughout the application, ordinal numbers (e.g., first, second, third, etc.) may be used as an adjective for an element (i.e., any noun in the application). The use of ordinal numbers is not to imply or create any particular ordering of the elements nor to limit any element to being only a single element unless expressly disclosed, such as using the terms before, after, single, and other such terminology. Rather, the use of ordinal numbers is to distinguish between the elements. By way of an example, a first element is distinct from a second element, and the first element may encompass more than one element and succeed (or precede) the second element in an ordering of elements.
[0027] It is to be understood that the singular forms a, an, and the include plural referents unless the context clearly dictates otherwise. For example, a capacitor may include any number of capacitors without limitation. Terms such as approximately, substantially, etc., mean that the recited characteristic, parameter, or value need not be achieved exactly, but that deviations or variations, including for example, tolerances, measurement error, measurement accuracy limitations and other factors known to those of skill in the art, may occur in amounts that do not preclude the effect the characteristic was intended to provide.
[0028] In the following description of
[0029] Analog processing-in-memory (PIM) in static random-access memory (SRAM) is promising for accelerating deep learning inference by circumventing the memory wall and exploiting ultra-efficient analog low-precision arithmetic. Latest analog PIM designs attempt bit-parallel schemes for multi-bit analog matrix-vector multiplication (MVM), aiming at higher energy efficiency, throughput, and training simplicity and robustness over conventional bit-serial methods that digitally shift-and-add multiple partial analog computing results. However, bit-parallel operations require more complex analog computations and become more sensitive to well-known analog PIM challenges, including large cell areas, inefficient and inaccurate multi-bit analog operations, and vulnerability to Process, Voltage, and Temperature (PVT) variations. Overall, an ideal PIM macro design should encompass a compact cell array and periphery, achieving multi-bit MVM with high accuracy and PVT robustness, and eliminating power-consuming analog buffers.
[0030] Embodiments disclosed herein generally relate to a PVT-robust and compact PIM SRAM macro with charge-domain bit-parallel computation. Specifically, the PIM macro device adopts (1) a charge-domain 4-bit multiply-and-add (MAC) module with a 6T-thin-cell-compatible layout, (2) an accurate in-situ charge-domain shift-and-add circuit, (3) a PVT-robust in-situ capacitive DAC (C-DAC) without power-consuming analog buffers, and (4) a compact and low-power dual-threshold time-domain ADC with power gating of the continuous comparator and D-flip-flops (DFFs). The terms PVT-robust and PVT-insensitive as used herein mean the same and may be used interchangeably to refer to reusing the same set of capacitors embedded in the PIM macro. Further, the term in-situ as used herein may be interpreted to mean embedded and charge-sharing. All analog computing modules, including capacitor-based digital-to-analog converters (DACs), MAC units, analog shift-and-add circuits, and analog-to-digital converters (ADCs) disclosed herein reuse one set of local metal-oxide-metal (MOM) capacitors inside the array, performing in-situ computation to save area and enhance accuracy. A compact 8.5-bit dual-threshold time-domain ADC power gates the main path most of the time, leading to a significant energy reduction. Depictions of various configurations of the PIM macro and methods of its use are provided in
[0031]
[0032] As previously stated, the building block of the PIM macro (100) is the cluster (106).
[0033] The MAC module (204) performs charge-domain MAC of a 4-bit digital input (108) and a 4-bit weight (W.sub.i) and include an array of switches: K.sub.1, M.sub.1, S.sub.G, S.sub.SL, S.sub.CH and S.sub.RT. The K.sub.1 switch is controlled by a bit from the 4-bit digital input (108). The multiplier switch M.sub.1 is controlled by a local bitline (LBL) (208). S.sub.CH and S.sub.RT are shared horizontally (i.e., row-wise) via a MAC Line (112) and vertically (i.e., column-wise) via a Share Line (210), respectively. In accordance with one or more embodiments, the array of switches may be implemented using an N-channel metal-oxide semiconductor (NMOS) transistor, a p-channel metal-oxide semiconductor (PMOS) transistor, or a transmission gate. For simplicity, the wordline and bitline for the access transistors on the right side of the 6T SRAM cells (202), which are only used for normal read/write, are omitted in
[0034] Continuing with
[0035] As previously stated, the PIM macro (100) adopts a multi-bit thin-cell MAC module (204) that shares the same transistor layout as the most compact 6T SRAM cell (202), differing only in metal connections.
[0036]
[0037]
[0038]
[0039] In
[0040] Embodiments disclosed herein achieve a total capacitance almost doubling that of bit-serial (BS) counterparts, significantly reducing the thermal noise and the current source noise from the VTC (602). Further, embodiments disclosed herein achieve a superior voltage scalability (down to 0.65 V) and an ultra-compact area. In addition, with a shared RO (606), the ADC (116) occupies an area of 387.9 square micrometer (m.sup.2) each, overall accounting for only 4.6% of the PIM macro (100) area. Further, sharing the RO (606) also benefits the phase noise and linearity since the stage delays can be up sized with few area and energy concerns. The local registers that dominate the TDC (604) area utilize a custom true single phase clocked (TSPC) structure which is 65% smaller than a standard-cell DFF, leading to further area reduction.
[0041] As previously noted, the key to the embedded capacitive computation is the recurrent usage over a single set of MOM capacitors for all analog tasks, including the C-DAC, analog MAC, analog shift-and-add and ADC, without extra peripheral circuitry. Throughout the entire analog processing chain, transistors only act as switches for fully charge-domain operations, eliminating PIM macro sensitivity to PVT variations of transistors. This approach is crucial for reducing area, mitigating computing nonlinearity, and eliminating buffering and sampling circuits. Meanwhile, despite various capacitor configurations for different tasks, the overhead of the computing circuitry in the array is reduced to minimal since it adopts a 6T-thin-cell-compatible layout.
[0042]
[0043] During the DAC phase 1 (DAC-P1) (804) and DAC phase 2 (DAC-P2) (806), the embedded C-DAC (110) takes advantage of all the MOM capacitors in a column, functioning as a reference generator, and samples the output voltage on the top plates of the MOM capacitors, while their bottom plates are grounded via MAC Lines (112). Specifically, during DAC-P1, S.sub.SL and S.sub.RT are set to a high (i.e., conducting) state to reset the MOM capacitors. Then, S.sub.SL and S.sub.RT are set to zero, and each bit of the 4-bit digital input (108) controls the switches K.sub.1 in its corresponding slice. The top plates of the MOM capacitors are either set to V.sub.IN, if the bit is 1, or keep at the reset voltage V.sub.R, if the bit is 0. During DAC-P2, with S.sub.SL set to a high (i.e., conducting) state, S.sub.RT set to a low (i.e., non-conducting) state, and the switches K.sub.1 turned off, the charge is shared through the Share Line (210) and the output voltage is sampled on the MOM capacitors.
[0044] As shown in
[0045] Keeping with
[0046] In accordance with one or more embodiments,
[0047] In accordance with one or more embodiments,
[0048] In accordance with one or more embodiments,
[0049]
[0050] All analog components in the computing path, including the C-DAC (110), MAC units (102), shift-and-add circuits (502), and ADC (116), contribute to the nonidealities of the PIM macro (100).
[0051]
[0052]
[0053] As previously stated, based solely on passive components, the PIM macro (100) disclosed herein achieves superior tolerance of PVT variations. The ADC (116) also has great scalability to voltage.
[0054] Embodiments disclosed herein achieve a weight storage density of 559 Kb/mm.sup.2 and exceptional robustness to temperature and voltage variations (40 to 105 C. and 0.65 to 1.2 V) among SRAM-based analog PIM designs. Further, including all the extra area for PIM, the memory density of the PIM macro (100) disclosed herein is only 31% lower than a logic-rule 6T SRAM cell (202), similar to that of an 8T SRAM. In addition, the PIM macro (100) achieves 3.6 memory density. In practice, embodiments disclosed herein are especially beneficial to PIM systems targeting fully on-chip weight storage for medium-sized models in ultra-low-power edge devices.
[0055]
[0056] In Block 1402, the 4-bit digital input (108) is transformed into an analog voltage using a plurality of C-DACs (110). The C-DAC (110) achieves a smaller area overhead by reusing MOM capacitors in the memory array as a capacitive voltage divider. The MOM capacitors are shared between the C-DACs (110) and the MAC units (102) and include a top plate and a bottom plate. Further, the MOM capacitors also sample the output voltage of the C-DAC (110) so that no extra analog output buffers are required. Inside the C-DAC (110), 32 clusters (106) combine into a column with a Share Line (210) connected together. To realize the embedded C-DAC (110), one memory column is divided into 4 slices (104). The switches K.sub.1 in each slice (104) are controlled by a different bit from the 4-bit digital input (108). The number of clusters in a slice (104) (i.e., 16, 8, 4, and 2) represents the weight of the corresponding digital input (108) bit.
[0057] In Block 1404, the array of switches configure the MOM capacitors to perform a pre-charging operation (PCH). During PCH (802), the top plates of the MOM capacitors, MAC Lines (112), and Share Lines (210) are initialized to ground, VDD, and ground, respectively.
[0058] In Block 1406, the array of switches reconfigure the MOM capacitors to perform a digital-to-analog operation (DAC-P1 and DAC-P2). During DAC-P1 (804) and DAC-P2 (806), the embedded C-DAC (110) takes advantage of all the MOM capacitors in a column, functioning as a reference generator, and samples the output voltage on the top plates of the MOM capacitors, while their bottom plates are grounded via MAC Lines (112). Specifically, during DAC-P1, S.sub.SL and S.sub.RT are set to a high (i.e., conducting) state to reset the MOM capacitors. Then, S.sub.SL and S.sub.RT are set to zero, and each bit of the 4-bit digital input (108) controls the switches K.sub.1 in its corresponding slice. The top plates of the MOM capacitors are either set to V.sub.IN, if the bit is 1, or keep at the reset voltage V.sub.R, if the bit is 0. During DAC-P2, with S.sub.SL set to a high (i.e., conducting) state, S.sub.RT set to a low (i.e., non-conducting) state, and the switches K.sub.1 turned off, the charge is shared through the Share Line (210) and the output voltage is sampled on the MOM capacitors.
[0059] In Block 1408, the array of switches reconfigure the MOM capacitors to perform a multiplication operation between the analog voltage and a weight stored in a 6-transitor (6T) static random-access memory (SRAM) cell. During the multiplication (Mul.) operation (808), one of the WLs (206) is activated to engage M.sub.1 and, depending on the data stored in the 6T SRAM cell (202), the MOM capacitors either discharge entirely or maintain their voltages. As a result, M.sub.1 will either be turned on to reset the MOM capacitor, which is equivalent to multiplying the input by logic 0, or remains off, which is equivalent to multiplying by logic 1.
[0060] In Block 1410, the array of switches reconfigure the MOM capacitors to perform an accumulation operation. During the accumulation (Acc.) operation (810), S.sub.SL and S.sub.RT are set to a high (i.e., conducting) state, grounding the top plates, and causing charge sharing across the MOM capacitors connected to the same MAC Line (112) in a given row.
[0061] In Block 1412, the array of switches reconfigure the MOM capacitors to perform a shift-and-add operation. During the charge-domain shift-and-add (S.A.) operation (812), enabled by S.sub.SA, the shift-and-add circuit (502) reuses (i.e., reconfigures) the local MOM capacitors and conducts weighted charge-sharing across neighboring rows. In addition, during the S.A. operation (812), and after the charge-sharing-based accumulation is finished, the switches in the P-Sum Combiner (i.e., S.sub.SA) are in a high state (i.e., conducting) to turn the separation switches (
[0062] In Block 1414, a final output voltage is obtained from the P-Sum Combiner (114). In Block 1416, the P-Sum Combiner (114) transmits the final output voltage to the analog-to-digital converter (ADC) (116) for digitalization.
[0063] In Block 1418, the ADC (116) converts the final output voltage into a digital output. In some embodiments, the ADC (116) is an 8.5-bit dual-threshold time-domain (TD) ADC (116). The ADC (116) includes a voltage-to-time converter (VTC) (602), a Time-to-Digital Converter (TDC) (604), and a ring oscillator (RO) (606). In accordance with one or more embodiments, the RO (606) is a global 8-phase differential RO. The VTC (602) discharges the capacitors attached to the MAC Lines (112) until it reaches the threshold voltage of the zero detector (Cmp1), thus converting output voltage (Vcap) into a pulse. Due to the shift-and-add circuits (502), the integration capacitor of the VTC (602) is the combination of MOM capacitors from four slices (104). In one or more embodiments, the TDC (604) adapts a compact folding-flash TDC topology to avoid the exponentially increased area of conventional flash TDCs. The local registers sample the phases of the RO (606) to generate the 3-bit fine results, and the local counter triggered by one of the phases in the RO (606) generates the 6-bit coarse results. In some embodiments, the local registers that dominate the TDC (604) area utilize a custom true single-phase clocked (TSPC) structure. The RO (606) is free running to avoid a long settling time while synchronized to the ADC (116) start signal (S.sub.AD) to prevent an uncertain initial state. A safe-stop mechanism synchronizes the counter's Stop and Trigger signals, preventing possible MSB errors caused by a wrong count when the two signals collide. A second low-power comparator (Cmp2) is added to power gate Cmp1 and TSPCs. Cmp2 is auto-zeroed by S.sub.AZ before conversion. Cmp2 has a slightly higher threshold (set by Vref) than Cmp1 to disable the main path of the ADC (116) most of the time to save its power consumption.
[0064] Although only a few example embodiments have been described in detail above, those skilled in the art will readily appreciate that many modifications are possible in the example embodiments without materially departing from this invention. Accordingly, all such modifications are intended to be included within the scope of this disclosure as defined in the following claims.