PARALLEL PROCESSOR IN ASSOCIATIVE CONTENT ADDRESSABLE MEMORY
20210019147 · 2021-01-21
Inventors
Cpc classification
G06F7/57
PHYSICS
G06F9/30036
PHYSICS
International classification
G06F9/30
PHYSICS
G06F7/57
PHYSICS
Abstract
A parallel processor in associative content-addressable memory (PPAC) is provided. Processing in memory (PIM) moves computation into memories with the goal of improving throughput and energy-efficiency compared to traditional von Neumann-based architectures. Most existing PIM architectures are either general-purpose but only support atomistic operations, or are specialized to accelerate a single task. The PPAC described herein provides a novel in-memory accelerator that supports a range of matrix-vector-product (MVP)-like operations that find use in traditional and emerging applications. PPAC is, for example, able to accelerate low-precision neural networks, exact/approximate hash lookups, cryptography, and forward error correction. The fully-digital nature of PPAC enables its implementation with standard-cell-based complementary metal-oxide-semiconductor (CMOS), which facilitates automated design and portability among technology nodes. A comparison with recent digital and mixed-signal PIM accelerators reveals that PPAC is competitive in terms of throughput and energy-efficiency, while accelerating a wide range of applications and simplifying development.
Claims
1. A parallel processor, comprising: a plurality of bit cells, each bit cell comprising: an addressable memory bit; and a first addressable logic operator connected to the addressable memory bit; and a first arithmetic logic unit (ALU) configured to compute a first function of outputs from the first addressable logic operators of the plurality of bit cells.
2. The parallel processor of claim 1, wherein the first function of the first ALU is configurable from a group of functions.
3. The parallel processor of claim 1, wherein the first ALU is further configured to compute a second function in parallel time with the first function.
4. The parallel processor of claim 1, wherein the first ALU is further configured to compute a second function in serial time with the first function.
5. The parallel processor of claim 1, wherein the first function of the first ALU comprises a programmable threshold function.
6. The parallel processor of claim 1, wherein: the plurality of bit cells comprises a first row of bit cells; the first ALU is connected to the first row of bit cells; and the first function of the first ALU comprises a matrix vector product operation.
7. The parallel processor of claim 6, wherein: the plurality of bit cells further comprises a second row of bit cells; and the parallel processor further comprises a second ALU connected to the second row of bit cells.
8. The parallel processor of claim 7, wherein the first ALU and the second ALU are configured to provide one of a single-bit matrix-vector product or a multi-bit matrix-vector product from a multi-bit matrix.
9. The parallel processor of claim 1, wherein each bit cell further comprises: a second addressable logic operator coupled to the addressable memory bit; and a multiplexer selectively providing a corresponding output from the first addressable logic operator or the second addressable logic operator.
10. The parallel processor of claim 9, wherein: the first addressable logic operator comprises XNOR; and the second addressable logic operator comprises XOR, OR, NOR, AND, or NAND.
11. The parallel processor of claim 9, wherein: the first addressable logic operator comprises an XNOR gate; the second addressable logic operator comprises an AND gate; and the multiplexer is coupled to the XNOR gate, the AND gate, and a control signal input.
12. The parallel processor of claim 11, wherein: the XNOR gate performs a first bitwise multiplication {1} between an input bit and a stored bit; and the AND gate performs a second bitwise multiplication {0, 1} between the input bit and the stored bit.
13. The parallel processor of claim 1, further comprising: a memory module comprising the addressable memory bits of the plurality of bit cells; and a processing module communicatively coupled to the memory module and comprising: the first ALU and additional ALUs; and the first addressable logic operators of the plurality of bit cells.
14. A memory device for performing logical operations, the memory device comprising: a plurality of rows of bit cells comprising content-addressable memory; and a plurality of row arithmetic logic units (ALUs), each row ALU being communicatively coupled to a respective one of the plurality of rows of bit cells; wherein each row ALU is configured to perform an operation on outputs from the respective one of the plurality of rows of bit cells.
15. The memory device of claim 14, wherein: each of the plurality of rows of bit cells is subdivided into a plurality of subrows; and each row ALU adds incoming local population counts of all subrows in the respective row and computes the operation based on a total population count of bit cell results for the respective row.
16. The memory device of claim 15, further comprising a plurality of subrow adders to perform a population count for each respective subrow of the plurality of subrows.
17. The memory device of claim 14, further comprising: a plurality of memory banks, each comprising a subset of the plurality of rows of bit cells; and a plurality of bank adders to sum an output from respective ALUs in each of the plurality of memory banks.
18. A method for performing logic functions in memory, the method comprising: performing a bit-wise logic operation on data stored on an array of addressable memory cells having M rows of N bits; and in parallel at each of the M rows, performing an arithmetic logic function on N corresponding outputs from the bit-wise logic operation.
19. The method of claim 18, wherein the bit-wise logic operation comprises a Hamming distance computation.
20. The method of claim 19, wherein: the bit-wise logic operation further comprises an inner-product computation; and the arithmetic logic function comprises a sum of the N corresponding outputs.
Description
BRIEF DESCRIPTION OF THE DRAWING FIGURES
[0030] The accompanying drawing figures incorporated in and forming a part of this specification illustrate several aspects of the disclosure, and together with the description serve to explain the principles of the disclosure.
[0031]
[0032]
[0033]
[0034]
[0035]
[0036]
[0037]
[0038]
[0039]
[0040]
[0041]
DETAILED DESCRIPTION
[0042] The embodiments set forth below represent the necessary information to enable those skilled in the art to practice the embodiments and illustrate the best mode of practicing the embodiments. Upon reading the following description in light of the accompanying drawing figures, those skilled in the art will understand the concepts of the disclosure and will recognize applications of these concepts not particularly addressed herein. It should be understood that these concepts and applications fall within the scope of the disclosure and the accompanying claims.
[0043] It will be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of the present disclosure. As used herein, the term and/or includes any and all combinations of one or more of the associated listed items.
[0044] It will be understood that when an element such as a layer, region, or substrate is referred to as being on or extending onto another element, it can be directly on or extend directly onto the other element or intervening elements may also be present. In contrast, when an element is referred to as being directly on or extending directly onto another element, there are no intervening elements present. Likewise, it will be understood that when an element such as a layer, region, or substrate is referred to as being over or extending over another element, it can be directly over or extend directly over the other element or intervening elements may also be present. In contrast, when an element is referred to as being directly over or extending directly over another element, there are no intervening elements present. It will also be understood that when an element is referred to as being connected or coupled to another element, it can be directly connected or coupled to the other element or intervening elements may be present. In contrast, when an element is referred to as being directly connected or directly coupled to another element, there are no intervening elements present.
[0045] Relative terms such as below or above or upper or lower or horizontal or vertical may be used herein to describe a relationship of one element, layer, or region to another element, layer, or region as illustrated in the Figures. It will be understood that these terms and those discussed above are intended to encompass different orientations of the device in addition to the orientation depicted in the Figures.
[0046] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used herein, the singular forms a, an, and the are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms comprises, comprising, includes, and/or including when used herein specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
[0047] Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. It will be further understood that terms used herein should be interpreted as having a meaning that is consistent with their meaning in the context of this specification and the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
[0048] A parallel processor in associative content-addressable memory (PPAC) is provided. Processing in memory (PIM) moves computation into memories with the goal of improving throughput and energy-efficiency compared to traditional von Neumann-based architectures. Most existing PIM architectures are either general-purpose but only support atomistic operations, or are specialized to accelerate a single task. The PPAC described herein provides a novel in-memory accelerator that supports a range of matrix-vector-product (MVP)-like operations that find use in traditional and emerging applications. PPAC is, for example, able to accelerate low-precision neural networks, exact/approximate hash lookups, cryptography, and forward error correction.
[0049] The fully-digital nature of PPAC enables its implementation with standard-cell-based complementary metal-oxide-semiconductor (CMOS), which facilitates automated design and portability among technology nodes. To demonstrate the efficacy of PPAC, post-layout implementation results in 28 nm CMOS are described herein for different array sizes. A comparison with recent digital and mixed-signal PIM accelerators reveals that PPAC is competitive in terms of throughput and energy-efficiency, while accelerating a wide range of applications and simplifying development.
[0050] Section headings are used in the present document only for improving readability and do not limit the scope of techniques or embodiments described herein. As such, techniques and embodiments described under a given section heading may be combined together with teachings of different section headings.
I. Introduction
[0051] As described above, the ever-growing gap between computing performance and memory access times has lead today's von Neumann-based computing systems to hit a so-called memory wall, where most of a system's bandwidth, energy, and time is consumed by memory operations. This problem is further aggravated with the rise of applications, such as machine learning, data mining, or Fifth Generation (5G) wireless systems, where massive amounts of data need to be processed at high rates and in an energy-efficient way.
A. Processing in Memory
[0052] PIM is an emerging computing paradigm that promises to tear down the memory wall. For example, PIM brings computation closer to memory, with the objective of reducing the time and energy of memory accesses, which ultimately increases a circuit's overall efficiency. The application of PIM to general-purpose processors has been explored recently. While such PIM-aided central processing units (CPUs) enable improved throughput and energy-efficiency for certain memory-intensive workloads, the supported PIM operations are typically limited to atomistic operations (such as bit-wise AND/NOR). As a consequence, executing even slightly more complex operations (such as multi-bit additions or multiplications) requires a repeated use of the supported PIM operations; this prevents such architectures from reaching the throughput and energy-efficiency required in many of today's applications.
[0053] Hence, a number of PIM-based application-specific integrated circuits (ASICs) have been explored recently. Such solutions generally excel in throughput and energy-efficiency, but have limited applicability, often accelerating a single task only. For example, a PIM-ASIC may be designed to accelerate neural network inference using mixed-signal techniques, but may suffer from effects caused by noise and process variation; this prevents its use in applications in which the least significant bit must be computed accurately (e.g., in cryptography, forward error correction, or locality-sensitive hashing).
B. Example Features of the Present Disclosure
[0054] While a range of PIM-based ASICs and CPUs have been proposed in recent years, no PIM-based solutions exist that simultaneously offer high flexibility and high efficiency. The PPAC described herein provides a PIM-based hardware solution (an in-memory accelerator that performs computation directly in the memory array, instead of moving data between memory and the data-path) with a novel, versatile in-memory parallel processor which supports a range of MVP-like operations. Embodiments of the PPAC are designed entirely in digital standard-cell-based CMOS, accelerate some of the key operations in a wide range of traditional and emerging applications, and achieve high throughput and energy-efficiency for the supported tasks.
[0055] In an exemplary aspect, the proposed architecture include a two-dimensional array of latch-based bit cells that support two or more types of binary-valued operations (e.g., XNOR and AND); each row of the PPAC array is equipped with a row arithmetic-logic unit (ALU) that supports a variety of tasks, including content-addressable memory (CAM) functionality, Hamming-distance calculation, one- and multi-bit MVPs, Galois field of two elements (GF(2)) MVPs, and programmable logic array (PLA) functionality.
[0056] Embodiments described herein can be used (among many other examples) for (i) low-precision MVPs, which find use in spatial equalization of millimeter-wave (mmWave) or terahertz (THz) communications systems, (ii) low-precision neural networks, which find use in real-time inference on battery-powered devices or self-driving cars, (iii) cryptography acceleration, which finds use in secure processors, networking applications, or general-purpose processors, (iv) decoding of error-correcting codes, which are used in communications systems and storage devices (e.g., hard-drives or flash storage), (v) intelligent CAMs, which can be used for network routing or nearest neighbor search using locality sensitive hashing, and (vi) arbitrary Boolean functions, which find use in programmable hardware fabrics such as field-programmable gate arrays (FPGAs) or PLAs.
[0057] In Section II, example operating principles and architecture of PPAC are described. Section III details a number of operation modes and potential use cases for the PPAC. Section IV presents a post-layout implementation in 28 nanometer (nm) CMOS technology. Area, throughput, and energy-efficiency results of this implementation are compared with recent related accelerators. Section V describes an exemplary process for performing logic functions in memory according to embodiments of the PPAC. Section VI presents an embodiment of the PPAC implemented in a wireless communications system.
II. PPAC: Parallel Processor in Associative CAM
[0058]
[0059] Similar to the traditional computing system 10 of
[0060] The processing device 28 represents one or more commercially available or proprietary general-purpose processing devices, such as a microprocessor, CPU, or the like. The processing device 28 is configured to execute processing logic instructions for performing operations and steps which may be stored in the memory 14. In this regard, the computing system 24 may be implemented with a microprocessor, FPGA, a digital signal processor (DSP), an ASIC, or other programmable logic device, a discrete gate or transistor logic, discrete hardware components, or any combination thereof.
[0061] The memory 14 may include non-volatile memory (e.g., read-only memory (ROM), erasable programmable ROM (EPROM), electrically EPROM (EEPROM), and the like) and volatile memory (e.g., random-access memory (RAM), such as dynamic RAM (DRAM) or synchronous DRAM (SDRAM)). In some examples, an operating system and/or any number of program modules or other applications can be stored in the memory 14, wherein the program modules represent a wide array of computer-executable instructions corresponding to programs, applications, functions, and the like that may implement the functionality described herein in whole or in part, such as through instructions on the processing device 28.
[0062] In an exemplary aspect, the memory 14 further includes a parallel processor, or PPAC 26, to flexibly perform computation directly in the memory 14, instead of moving all data between the memory and the processing device 28 for logic operations and/or other computation.
[0063]
[0064] Below are described non-limiting examples of operating techniques of PPAC 26 and example PPAC 26 architectures. In what follows, the terms word and vector will be used interchangeablyan N-bit word can also be interpreted as a binary-valued vector of dimension N.
A. Examples of Operational Techniques
[0065] Embodiments of the PPAC 26 build upon CAMs, which are memory arrays that compare all of their M stored N -bit words a.sub.m, m=1, . . . , M, with an N-bit input word x to determine the set of stored words that match the input.
[0066] Conceptually, the functionality of a CAM can be described as a memory in which every bit cell contains an XNOR gate to determine whether the stored value a.sub.m,n matches the input bit x.sub.n, n=1, . . . , n. A match is then declared only if all the N bits in a.sub.m match with the N bits of the input x. Mathematically, the functionality of a CAM can be expressed in terms of the Hamming distance h(a.sub.m,x), which indicates the number of bits in which a.sub.m and x differ. A CAM declares a match between the stored word a.sub.m and the input word x if h(a.sub.m,x)=0.
[0067]
[0068]
[0069] In the example of
[0070] It is important to realize that with the availability of the Hamming similarity
a.sub.m,x
=.sub.n=1.sup.Na.sub.m,nx.sub.n=2
[0071] To see this, note that each partial product a.sub.m,nx.sub.n is +1 if a.sub.m,n=x.sub.n and 1 if a.sub.m,nx.sub.n; this partial product can be computed with an XNOR. If all N entries of a.sub.m and x differ, then a.sub.m,x
=N. Otherwise, for each bit n for which a.sub.m,n=x.sub.n, the partial product a.sub.m,nx.sub.n changes from 1 to +1, increasing (a.sub.m,x) by 2. As the total number of bits that are equal between a.sub.m and x is given by
a.sub.m,x
can be computed in accordance with Equation 1. Note that the PPAC 26 can compute
a.sub.m,x
in parallel for all the stored words a.sub.m, m=1, . . . , M, which is exactly a 1-bit MVP Ax between the matrix A (whose rows are the words a.sub.m) and the input vector x. Such MVPs can be computed in a single clock cycle.
[0072] As will be further shown in Section III, embodiments of the PPAC 26 can compute multi-bit MVPs bit-serially over several clock cycles. Furthermore, while the XNOR gate 40 was used to multiply {1} entries, an AND gate can be included in each bit cell 32 to enable the multiplication of {0,1} entries. With this AND functionality, embodiments of the PPAC 26 can additionally perform (i) operations in GF(2), (ii) standard unsigned and 2's-complement signed arithmetic, and (iii) arbitrary Boolean functions in a similar fashion to a PLA.
B. Examples of Architecture Details
[0073]
[0074]
[0075] The addressable memory bits 34 of the bit cells 32 are written if the address addr corresponding to that row 46 and the write enable signal wrEn are asserted; clock gates 56 and a storage multiplexer 58 are used to implement this functionality. Once the addressable memory bits 34 are written and the control signal s.sub.n has been fixed for each column, different input vectors x can be applied to the PPAC 26. Then, the bit cell 32 operation results are passed to the row ALU 38 (see
[0076]
[0077] A second accumulator 62 is used in applications where the matrix A has multi-bit entries. A programmable threshold .sub.m is then subtracted from the output of the second accumulator 62 to generate the output y.sub.m of the row ALU 38, whose interpretation depends on the operation mode. Note that the row ALU 38 contains two quantities that must be stored at configuration time: (i) The offset c used to correctly interpret r.sub.m and (ii) the threshold .sub.m. Finally, to increase the throughput of the PPAC 26, a pipeline stage is added after the row population count (e.g., using the bank adder 50 of
III. Examples of PPAC Modes and Applications
[0078] With continuing reference to
A. Hamming Similarity
[0079] In this mode, the PPAC 26 computes the Hamming similarity between the M words a.sub.m, m=1, . . . , M, stored in each row 46 and the input word x. To this end, the bit cells 32 are configured to use the XNOR gate 40, so that the row population count r.sub.m corresponds to
[0080] By setting .sub.m=N, the PPAC 26 can be used as a regular CAM. If all the bits of the stored word a.sub.m match the bits of x, then r.sub.m=N; hence, y.sub.m=0 and a match is declared. Otherwise, if r.sub.m<N, then y.sub.m<0. Thus, a complete-match can be declared by just looking at the MSB of the output y.sub.m. By setting 0.sub.mN, the PPAC 26 declares a similarity-match whenever
[0081] In this operation mode, the PPAC 26 can be used for applications that rely on CAMs, including network switches and routers, computer caches, and content-addressable parallel processors (CAPPs). In this mode, the PPAC 26 can also be used for particle track reconstruction and for locality-sensitive hashing (LSH), which enables computationally efficient approximate nearest neighbor search.
B. 1-Bit Matrix-Vector-Products
[0082] In this mode, the PPAC 26 computes one MVP y=Ax per clock cycle, where y.sub.m=a.sub.m,x
, m=1, . . . , M, and a.sub.m and x are both N-dimensional vectors with 1-bit entries. Detailed below are example embodiments of the PPAC 26 able to support different 1-bit number formats.
1. Matrix and Vector with {1} Entries
[0083] In this configuration, the LO and HI logical levels are interpreted as 1 and +1, respectively, for both the matrix A stored in the PPAC 26 and the input vector x. Multiplication between a bit in a.sub.m (the mth row 46 of A) and a bit in x can be computed via the XNOR gate 40 of the bit cell 32. However, the row population count r.sub.m is an unsigned number in the range [0,N]. To obtain the inner product a.sub.m,x
from r.sub.m, Equation 1 is used, which can be implemented in the row ALU 38 by setting cEn=1, c=N, and popX2 to double the row population count (by left-shifting r.sub.m once).
2. Matrix and Vector with {0,1} Entries
[0084] In this configuration, the LO and HI logical levels are interpreted as 0 and 1, respectively, for both the matrix and input vector. Multiplication between a bit in a.sub.m and a bit in x will be 1 only if both entries are 1; this corresponds to using the AND gate 52 in each bit cell 32. Hence, the row population count satisfies r.sub.m=a.sub.m,x
, which can be passed directly to the row ALU 38 output y.sub.m.
3. Matrix with {1} Entries and Vector with {0,1} Entries
[0085] In this configuration, the vector x is expressed as x=0.5({circumflex over (x)}+1), where {circumflex over (x)} has {1} entries and 1 is the all-ones vector. Note that {circumflex over (x)} can be easily obtained by setting the entries of x that are 0 to 1; i.e., {circumflex over (x)} and x are equivalent in terms of logical LO and HI levels. Equation 1 is used to obtain the following equivalence:
a.sub.m,x
=
[0086] This requires computation of
4. Matrix with {0,1} Entries and Vector with {1} Entries
[0087] In this configuration, the vector x is expressed as x=2{tilde over (x)}1, where {tilde over (x)} has {0,1} entries and, as above, has the same logical LO and HI levels as x. Noting that a.sub.m, 1
=N
a.sub.m,x
2
a.sub.m,{tilde over (x)}
+
[0088] As in Equation 2, this requires computation of a.sub.m,{tilde over (x)}
for all rows 46 m=1, . . . , M of the PPAC 26, but this time with popX2, n0Z, and cEn set to 1, and c=N to complete Equation 3. As above,
[0089] 1-bit {1} MVPs can, for example, be used for inference of binarized neural networks. While 1-bit MVPs in the other number formats might have limited applicability, they are used for multi-bit operations as described next.
C. Multi-Bit Matrix-Vector-Products
[0090] In this mode, the PPAC 26 computes MVPs y=Ax where the entries of A and/or x have multiple bits. All of these multi-bit operations are carried out in a bit-serial manner, which implies that MVPs are computed over multiple clock cycles.
1. Multi-Bit Vector
[0091] Consider the case where A has 1-bit entries, while the vector x has L-bit entries. As a starting point:
x=Equation 4
where x is a 1-bit vector formed by the
th bit of all the entries of x. This decomposition enables rewriting of the MVP as follows:
=Equation 5
[0092] The 1-bit MVP mode of the PPAC 26 with input x.sub.L (the MSB of the entries of x) is used to compute Ax.sub.L. The result is stored in the first accumulator 60 of the row ALU 38 by setting weV to 1. In the subsequent clock cycle, this value is doubled and added to Ax.sub.L1 by setting vAcc to 1. By repeating this operation for =L,L1, . . . 1, the MVP y=Ax is computed bit-serially in L clock cycles.
2. Multi-Bit Matrix
[0093] Consider the case where each entry of A has K-bit entries. A=.sub.k1.sup.K2.sup.k1A.sub.k can be decomposed, where A.sub.k is a 1-bit matrix formed by the kth bit of all entries of A. In contrast to the multi-bit vector case, the addressable memory bits 34 of the PPAC 26 cannot be replaced to contain a different matrix A.sub.k every cycle. Instead, different columns of the PPAC 26 are used for different bit-significance levels, so that all K bits of the entries of A are stored in the addressable memory bits 34. As a result, the PPAC 26 will now contain N/K different K-bit entries per row 46, instead of N different 1-bit entries per row 46. To ensure that only elements from A.sub.k are used, the columns with different significance are configured to use the AND gate 52, and the corresponding entry of x is set to 0, effectively nulling any contribution from these columns to the row population count r.sub.m. The rest of the columns are configured according to the used number format, and c in the row ALUs 38 is set to N/K for the number formats that use it, so that the PPAC 26 computes A.sub.kx for an input x that has N/K entries of L bits.
[0094] The PPAC 26 starts by computing A.sub.Kx (i.e., the MVP using the most significant bit of the entries of A) and saves the result in the second accumulator 62 of the row ALU 38 (by setting weM to 1), so that after L cycles (assuming each vector entry has L bits), it can double the accumulated result and add it to A.sub.K1x by setting mAcc to 1. The new accumulated result is stored in the second accumulator 62, which will be written again L clock cycles later. By repeating this procedure, the multi-bit MVP y=Ax is computed bit-serially over KL clock cycles.
3. Multi-Bit Matrix
[0095] As detailed in Section III-B, the PPAC 26 is able to compute multi-bit MVPs with different number formats summarized in Table I.
TABLE-US-00001 TABLE I L-BIT NUMBER FORMATS SUPPORTED BY PPAC Name uint int oddint LO level 0 0 1 HI level 1 1 1 Signed? No Yes Yes Min. value 0 2.sup.L1 2.sup.L + 1 Max. value 2.sup.L 1 2.sup.L 1 1 2.sup.L 1 E.g., L = 2 {0, 1, 2, 3} {2, 1, 0, 1} {3, 1, 1, 3}
[0096] For example, by mapping the logical LO level to 0 and HI to 1, multi-bit MVPs between unsigned numbers (uint) are performed. To operate with signed numbers (int), embodiments may negate (in 2's complement representation) the partial products A.sub.kx.sub.L (for signed multi-bit vectors) or A.sub.Kx (for signed multi-bit matrices), which are associated with the MSBs of the signed numbers in the vector x and matrix A, respectively. The row ALUs 38 can be configured to implement this behavior by setting vAccX1 and mAccX1 to 1 for a signed vector or matrix, respectively. The oddint number format arises from having a multi-bit number in which LO and HI get mapped to 1 and +1, respectively.
[0097] Then, by applying (4), oddint represents signed odd numbers, as illustrated in Table I. Note that oddint cannot represent 0.
[0098] Low-resolution multi-bit MVPs using different number formats find widespread use in practice. For example, neural network inference can be executed with matrices and vectors using low-precision int numbers, where the threshold .sub.m in the row ALU 38 can be used as the bias term of a fully-connected (dense) layer. A 1-bit oddint matrix multiplied with a multi-bit int vector can be used to implement a Hadamard transform, which finds use in signal processing, imaging, and communications applications.
D. GF(2) Matrix-Vector-Products
[0099] In this mode, the PPAC 26 is able to perform MVPs in GF(2), the finite field with two elements {0,1}. Multiplication in this field corresponds to an AND operation; addition corresponds to an XOR operation, which is equivalent to a simple addition modulo-2. GF(2) addition can then be performed by extracting the least significant bit (LSB) of a standard integer addition. To support MVPs in this mode, all of the columns of the PPAC 26 are set to use the AND gate 52 in the bit cells 32, and the row ALU 38 is configured so that r.sub.m=y.sub.m. Then, the result of a.sub.m,x
in GF(2) can be extracted from the LSB of y.sub.m. Recent mixed-signal architectures that support MVPs are unable to support this mode as the LSBs of analog additions are generally not bit-true.
[0100] GF(2) MVPs find widespread application in the computation of substitution boxes of encryption systems, including the advanced encryption standard (AES), as well as in encoding and decoding of error-correction codes, such as low-density parity-check and polar codes.
E. Programmable Logic Array
[0101] In this mode, each bank 44 of the PPAC 26 is able to compute a Boolean function as a sum of min-terms, similar to a PLA. To this end, the mth row 46 computes a min-term as follows: Each PPAC 26 column and entry of the input vector x correspond to a different Boolean variable X; note that the complement
[0102] Note that the PPAC 26 also supports different logic structures. For example, if .sub.m=1, then each row 46 will be computing a max-term. If the result of the Boolean function is interpreted to be 1 only if p.sub.b is equal to the number of programmed max-terms in the bank, the PPAC 26 effectively computes a product of max-terms. In general, the PPAC 26 can execute a logic function with two levels: The first stage can be a multi-operand AND, OR, or majority gate (MAJ) of the Boolean inputs; the second stage can be a multi-operand AND, OR, or MAJ of the outputs of the first stage. With this, the PPAC 26 can be used as a look-up table or programmed as a PLA that computes Boolean functions.
IV. Implementation Results
[0103] Presented below are post-layout implementation results of various PPAC 26 array sizes in 28 nm CMOS and a comparison to existing in-memory accelerators and other related designs.
A. Post-Layout Implementation Results
[0104] Disclosed below are four different MN PPAC 26 arrays in 28 nm CMOS. All of these PPAC 26 implementations have banks 44 formed by sixteen rows 46, each with V=16 bit cells 32 per subrow 48, and a row ALU 38 that supports multi-bit operations with L and K up to 4 bits. Table II, below, summarizes example post-layout implementation results.
TABLE-US-00002 TABLE II POST- LAYOUT IMPLEMENTATION RESULTS FOR DIFFERENT PPAC ARRAY SIZES IN 28 NM CMOS Words M 16 16 256 256 Word-Length N 16 256 16 256 Banks B 1 1 16 16 Subrows B.sub.a 1 16 1 16 Area [m.sup.2] 14 161 72 590 185 283 783 240 Density [%] 75.77 70.45 72.52 72.13 Cell area [kGE] 17 81 213 897 Max. clock freq. [GHz] 1.116 0.979 0.824 0.703 Power [mW] 6.64 45.60 78.65 381.43 Peak throughput [TOP/s] 0.55 8.01 6.54 91.99 Energy-eff. [fJ/OP] 12.00 5.69 12.03 4.15
[0105] The throughput is measured in operations (OP) per second, where both 1-bit multiplications and 1-bit additions are counted as one OP each. Since each row 46 performs an inner product between two N-dimensional 1-bit vectors, an MN PPAC 26 performs M(2N1) OP per clock cycle. Even if the clock frequency decreases as dimensions of the PPAC 26 increase, the overall throughput increases up to 92 TOP/s for the 256256 array; this occurs due to the massive parallelism of the disclosed PPAC 26 design. It is observed that increasing the number of words M results in a higher area and power consumption than increasing the number of bits per word N by the same factor. This behavior is due to the fact that adding a new row 46 implies including a new row ALU 38, whose area can be comparable to that of the row 46 memory. In contrast, increasing the number of bits per word N mainly modifies the datapath width of an existing row ALU 38, which scales only logarithmically in N, improving the energy-efficiency of the 256256 PPAC 26 to 4.15 fJ/OP.
[0106] Table III, below, summarizes the throughput, power, and energy-efficiency for the different operation modes executed on a 256256 PPAC 26.
TABLE-US-00003 TABLE III THROUGHPUT, POWER, AND ENERGY-EFFICIENCY FOR DIFFERENT APPLICATIONS WITH A 256 256 PPAC ARRAY IN 28 NM CMOS Throughput Power Energy-eff. Operation mode [GMVP/s] [mW] [pJ/MVP] Hamming similarity 0.703 478 680 1-bit {1} MVP 0.703 498 709 4-bit {0, 1} MVP 0.044 226 5 137 GF(2) MVP 0.703 353 502 PLA 0.703 352 501
[0107] Throughput and energy-efficiency are measured in terms of MVPs, where for the Hamming-similarity mode, an MVP corresponds to the computation of M=256 Hamming similarities; for the PLA mode, an MVP computes B=16 distinct Boolean functions. To extract power estimates, Cadence Innovus was used, as was stimuli-based post-layout simulations at 0.9 V and 25 C. in the typical-typical process corner. In the simulations, a randomly-generated matrix A was loaded into the addressable memory bits 34 of the PPAC 26, and then 100 random input vectors x applied for the 1-bit operations, while for the 4-bit {0,1} MVP case, 100 different MVPs were executed. The dynamic and static power consumption of the PPAC 26 were simulated only while performing computations (i.e., the power consumption of initializing the matrix A was excluded), as this is the envisioned use case for the PPAC 26applications in which the matrix A remains largely static but the input vectors x change at a fast rate. From Table III, it can be observed that operations that use the XNOR gate 40 (i.e., Hamming similarity and 1-bit {1} MVP) exhibit higher power consumption than tasks relying on the AND gate 52; this is because the switching activity at the output of XNOR gates 40 is, in general, higher than that of AND gates 52.
B. Comparison with Existing Accelerators
[0108] In Table IV, the 256256 PPAC 26 was compared with existing hardware accelerators that have been specialized for binarized neural network (BNN) inference and support fully-connected layers.
TABLE-US-00004 TABLE IV COMPARISON WITH EXISTING BINARIZED NEURAL NETWORK (BNN) ACCELERATOR DESIGNS Mixed technology Supply Area Peak TP Energy-eff. Peak TP.sup.a Energy-eff..sup.a Design PIM? signal? Implementation [ m] [V] [mm.sup.2] [GOP/s] [TOP/
/W] [GOP/s] [TOP
/W] PPAC yes no layout 28 0.9 0.78 91994 184 91994 184 CIMA [6] yes yes silicon 65 1.2 8.56 4720 152 10957 1
456 Bankman et al. [19] no yes silicon 28 0.8 5.95 532 429 BRein [10] yes no silicon 65 1.0 3.9 1.38 2.3 3.2 15 UNPU [23] no no silicon 65 1.1 16 7
372 46.7.sup.b 17114 376 XNE [24] no no layout 22 0.8 0.016 108 112 84.7 54.6
Technology scaling to 28 nm CMOS at V
a = 0.
V assuming standard rules A ~1/
2.
pi ~
/
and P
ayo ~1/(V
).
Number reported in [23. FIG. 13]; note that the peak TP (7
372 GOP/s) divided by the reported power consumption (297 mW) yields 24.8 TOP
W.
indicates data missing or illegible when filed
[0109] These designs were compared against, as their operation closely resembles that of the 1-bit {1} MVP operation mode. In fact, all of the considered designs count 1-bit products and additions as one operation (OP) eachan inner product between two N-dimensional 1-bit vectors is 2N OPs. Some current designs are PIM accelerators in which part of the computation is carried out within the bit cells 32 or rely on mixed-signal techniques to compute MVPs.
[0110] By considering technology scaling, it was seen that the energy efficiency (in terms of TOP/s/W) of the PPAC 26 is comparable to other fully-digital designs but 7.9 times and 2.3 times lower than conventional mixed-signal designs, respectively, where the latter is implemented in a comparable technology node as the PPAC 26. As noted in Section III-D, mixed-signal designs are particularly useful for tasks that are resilient to noise or process variation, such as neural network inference. However, mixed-signal designs cause issues in applications that require bit-true results, such as addition in GF(2), which requires the LSB of an integer addition to be exact.
[0111] It was also seen that the PPAC 26 achieved the highest peak throughput among the considered designs, which is due to its massive parallelism. It is to be emphasized, however, that the PPAC 26 performance was extracted from post-layout simulations, whereas all the other designs are silicon-proven. Furthermore, all other designs not only execute 1-bit MVPs, but they also include other operations that are required to implement BNN inference, such as activation functions and batch normalization. The PPAC 26, in contrast, executes a 256256 MVP followed by adding a bias vector, which is a large portion of the operations required to process a fully-connected BNN layer. As a result, the reported throughput and energy-efficiency for the PPAC 26 are based on simulations.
[0112] It is to be reiterated that the PPAC 26 is a massively-parallel PIM engine that can be used for a number of different MVP-like operations, where 1-bit MVP is just one of them. As such, the main purpose of the comparison in Table IV is to demonstrate that the 1-bit {1} MVP operation mode holds promise with an energy-efficiency that is comparable to that of other accelerators. While some the hardware designs are specialized to carry out 1-bit MVPs and some designs are specialized to execute multi-bit MVPs for neural network inference, the PPAC 26 is programmable to perform not only these operations, but also GF(2) MVPs, Hamming-similarity computations, and PLA or CAM functionality, opening up its use in a wide range of applications. In this sense, the PPAC 26 replaces current implementations where PIM is used to accelerate multiple applications, such as database query processing, cryptographic kernels, and in-memory checkpointing. In some embodiments, the PPAC 26 will be integrated into a system. In some implementations used to compute MVPs, an element-wise multiplication between two vectors whose entries are L-bit requires L.sup.2+5 L2 clock cycles, which is a total of 34 clock cycles for 4-bit numbers. Then, the reduction (via sum) of an N-dimensional vector with L-bits per entry requires (L log .sub.2(N)) clock cycles, which is at least 64 clock cycles for a 256-dimensional vector with 8-bit entries (as the product of two 4-bit numbers results in 8-bit). Hence, an inner product between two 4-bit vectors with 256 entries requires at least 98 clock cycles. By contrast, some embodiments of the PPAC 26 require only 16 clock cycles for the same operation. This significant difference in the number of clock cycles is caused by the fact that while some conventional designs are geared towards data-centric applications in which element-wise operations are performed between high-dimensional vectors to increase parallelism, the PPAC 26 aims at accelerating a wide range of MVP-like operations, one of the reasons why embodiments herein included dedicated hardware (such as the row pop-count) to speed up element-wise vector multiplication and vector sum-reduction.
V. Flow Diagram for Performing Logic Functions in Memory
[0113]
VI. Implementation for Wireless Communications
[0114]
[0115] In an exemplary aspect, the wireless communications system 66 includes an antenna array 70 (e.g., having B antennas), which can be operated for multi-user, multiple input, multiple output (MU-MIMO) communication. Accordingly, the wireless communications system 66 can be configured to perform spatial equalization in uplink communications (e.g., communications from a user equipment (UE) to a core network). The purpose of spatial equalization is to beamform received signals by collecting the signals at B antennas of the antenna array 70, while suppressing interference (e.g., inter-UE interference).
[0116] In an exemplary aspect, the baseband processing circuitry 68 is implemented as an ASIC in CMOS technology. The baseband processing circuitry 68 includes an array of successive-approximation-register (SAR) analog-to-digital converters (ADCs) 72 (e.g., connected to RF receivers 74), a fast Fourier transform (FFT) engine 76, a channel estimator 78, and a spatial equalizer. The spatial equalizer is implemented using a specialized variant of the PPAC 26 described herein.
[0117] The SAR ADCs 72 convert the analog signals sampled at the base-station (e.g., wireless communications system 66) antenna array 70. The channel estimator 78 obtains the information required to compute a spatial equalization matrix, which is stored in memory of the PPAC 26. The FFT engine 76 allows the PPAC 26 to operate in either beamspace or the conventional antenna domain, where the latter corresponds to taking the outputs directly from the SAR ADCs 72. The PPAC 26 then multiplies an equalization matrix stored on its memory with the signals coming from the SAR ADCs 72 (which might have been converted into beamspace by the FFT engine 76).
[0118]
[0119] The PPAC 26 offers superior area- and energy-efficiency when compared with traditional bit-parallel architectures that perform low-resolution MVPs. The implemented PPAC 26 of
[0120] Those skilled in the art will recognize improvements and modifications to the preferred embodiments of the present disclosure. All such improvements and modifications are considered within the scope of the concepts disclosed herein and the claims that follow.