Hardware-implemented hybrid analog-digital mixed-mode matrix multiply-add processing element for machine learning applications
12242906 ยท 2025-03-04
Inventors
Cpc classification
International classification
Abstract
A hardware-implemented hybrid analog-digital mixed-mode matrix multiply-add processing element (PE) for machine learning (ML) applications is disclosed. The hardware-implemented hybrid analog-digital mixed-mode matrix multiply-add PE for ML applications involves using Fin Field-Effect Transistors (FinFETs), which provide excellent sub-threshold operation, thereby reducing power requirements, and use variation minimization strategies to improve the overall accuracy. In this way, hybrid analog-digital mixed-mode matrix multiply-add calculations are efficient, low power, and accurate, with the processing element itself in a relatively small surface area.
Claims
1. A hardware-implemented hybrid analog-digital mixed-mode matrix multiply-add processing element (PE) for machine learning (ML) applications comprising: a 4 bit multiply processing element (PE) that performs a multiplication function; a first circuit that implements the 4 bit multiply PE to calculate a first current result based on a first reference current; a second circuit that implements the 4 bit multiply PE to calculate a second current result based on a second reference current; and a connection between a first output of the first circuit and a second output of the second circuit to produce a combined current resulting from implementation of a multiply-add function that calculates a sum of the first current result and the second current result.
2. The hardware-implemented hybrid analog-digital mixed-mode matrix multiply-add PE for ML applications of claim 1 further comprising an analog-to-digital converter (ADC) that is configured to convert the combined current to a digital value based on the sum of the first current result and the second current result.
3. The hardware-implemented hybrid analog-digital mixed-mode matrix multiply-add PE for ML applications of claim 1, wherein the first circuit comprises a plurality of first circuit transistors, wherein the second circuit comprises a plurality of second circuit transistors.
4. The hardware-implemented hybrid analog-digital mixed-mode matrix multiply-add PE for ML applications of claim 3, wherein the plurality of first circuit transistors comprise a bottom stack of four first circuit transistors and a top stack of four first circuit transistors.
5. The hardware-implemented hybrid analog-digital mixed-mode matrix multiply-add PE for ML applications of claim 4, wherein the bottom stack of four first circuit transistors serves as a 4 bit digital-to-analog converter (DAC) for incoming data.
6. The hardware-implemented hybrid analog-digital mixed-mode matrix multiply-add PE for ML applications of claim 4, wherein the bottom stack of four first circuit transistors are connected to a current mirror configuration controlled by a voltage of a reference current.
7. The hardware-implemented hybrid analog-digital mixed-mode matrix multiply-add PE for ML applications of claim 6, wherein the bottom stack of four first circuit transistors comprises a first bottom transistor, a second bottom transistor, a third bottom transistor, and a fourth bottom transistor, wherein the top stack of four transistors comprises a first top transistor, a second top transistor, a third top transistor, and a fourth top transistor.
8. The hardware-implemented hybrid analog-digital mixed-mode matrix multiply-add PE for ML applications of claim 7, wherein a first size of the first bottom transistor is eight times a fourth size of the fourth bottom transistor, wherein a second size of the second bottom transistor is four times the fourth size, wherein a third size of the third bottom transistor is two times the fourth size, wherein the first top transistor is the first size, the second top transistor is the second size, the third top transistor is the third size, and the fourth top transistor is the fourth size.
9. The hardware-implemented hybrid analog-digital mixed-mode matrix multiply-add PE for ML applications of claim 8, wherein a first branch of transistors comprises the first bottom transistor and the first top transistor, wherein a second branch of transistors comprises the second bottom transistor and the second top transistor, wherein a third branch of transistors comprises the third bottom transistor and the third top transistor, wherein a fourth branch of transistors comprises the fourth bottom transistor and the fourth top transistor.
10. The hardware-implemented hybrid analog-digital mixed-mode matrix multiply-add PE for ML applications of claim 9, wherein current in the fourth branch is equal to the reference current, current in the third branch is two times the reference current, current in the second branch is four times the reference current, and current in the first branch is eight times the reference current.
11. The hardware-implemented hybrid analog-digital mixed-mode matrix multiply-add PE for ML applications of claim 10, wherein a connection of drains of the first top transistor, the second top transistor, the third top transistor, and the fourth top transistor provides a summation of the currents of the first branch, the second branch, the third branch, and the fourth branch, wherein the summation provides a binary representation of the incoming data in terms of current.
12. The hardware-implemented hybrid analog-digital mixed-mode matrix multiply-add PE for ML applications of claim 11, wherein the plurality of second circuit transistors comprise a lower PMOS, an upper PMOS, a lower stack of second circuit transistors, and an upper stack of second circuit transistors, wherein the summation of the currents serves as a second circuit reference current for a current mirror formed by the lower PMOS and the lower stack of second circuit transistors, wherein the upper stack of second circuit transistors serve as a DAC for incoming weight vector data.
13. The hardware-implemented hybrid analog-digital mixed-mode matrix multiply-add PE for ML applications of claim 12, wherein the lower stack of second circuit transistors comprises a first lower transistor, a second lower transistor, a third lower transistor, and a fourth lower transistor, wherein the upper stack of second circuit transistors comprise a first upper transistor, a second upper transistor, a third upper transistor, and a fourth upper transistor.
14. The hardware-implemented hybrid analog-digital mixed-mode matrix multiply-add PE for ML applications of claim 13, wherein a first branch of second circuit transistors comprises the first lower transistor and the first upper transistor, wherein a second branch of second circuit transistors comprises the second lower transistor and the second upper transistor, wherein a third branch of second circuit transistors comprises the third lower transistor and the third upper transistor, wherein a fourth branch of second circuit transistors comprises the fourth lower transistor and the fourth upper transistor.
15. The hardware-implemented hybrid analog-digital mixed-mode matrix multiply-add PE for ML applications of claim 14, wherein current in the fourth branch of second circuit transistors is equal to the second circuit reference current, current in the third branch of second circuit transistors is two times the second circuit reference current, current in the second branch of second circuit transistors is four times the second circuit reference current, and current in the first branch of second circuit transistors is eight times the second circuit reference current.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) Having described the invention in general terms, reference is now made to the accompanying drawings, which are not necessarily drawn to scale, and wherein:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
DETAILED DESCRIPTION
(15) In the following detailed description of the invention, numerous details, examples, and embodiments of the invention are described. However, it will be clear and apparent to one skilled in the art that the invention is not limited to the embodiments set forth and that the invention can be adapted for any of several applications.
(16) Embodiments of the invention described in this specification provide a hardware-implemented hybrid analog-digital mixed-mode matrix multiply-add processing element (PE) for machine learning (ML) applications. In some embodiments, the hardware-implemented hybrid analog-digital mixed-mode matrix multiply-add PE for ML applications employs analog add-multiply PEs and FinFETs with sub-threshold operations which reduce overall power consumption while improving overall accuracy through sub-threshold operations and variation minimization strategies.
(17) As stated above, power and area efficient add-multiply implementation for ML applications are currently difficult to achieve due to the power issues and large silicon area of digital implementations, while in the case of analog implementation the issues are related to accuracy and relatively large area and power due to the required ADC. Consequently, existing options have not been able to provide lower power consumption in a smaller spatial footprint with better accuracy of results in the analog or digital implementations. Embodiments of the hardware-implemented hybrid analog-digital mixed-mode matrix multiply-add PE for ML applications described in this specification solve such problems by way of a hybrid analog-digital implementation of multiply add.
(18) Embodiments of the hardware-implemented hybrid analog-digital mixed-mode matrix multiply-add described in this specification differ from and improve upon currently existing implementations of multiply-add processing elements. For instance, existing analog implementations have accuracy issues due to limited dynamic range, variation, and the use of many power hungry ADCs. By contrast, the proposed hardware-implemented hybrid analog-digital mixed-mode matrix multiply-add PE for ML applications solves the accuracy and power issues of typical analog implementations, and is much more efficient that the typical full digital implementations. In particular, the hardware-implemented hybrid analog-digital mixed-mode matrix multiply-add PE for ML applications of some embodiments is a very efficient current mirror based multiplier circuit implementation and utilizes FinFETs with excellent sub-threshold operation, thereby reducing overall power consumption. Furthermore, the FinFETs provide variation minimization strategies which improve the overall accuracy.
(19) The hardware-implemented hybrid analog-digital mixed-mode matrix multiply-add PE for ML applications of the present disclosure may be comprised of the following elements. This list of possible constituent elements is intended to be exemplary only and it is not intended that this list be used to limit the hardware-implemented hybrid analog-digital mixed-mode matrix multiply-add PE for ML applications of the present application to just these elements. Persons having ordinary skill in the art relevant to the present disclosure may understand there to be equivalent elements that may be substituted within the present disclosure without changing the essential function or operation of the hardware-implemented hybrid analog-digital mixed-mode matrix multiply-add PE for ML applications. 1. 4-bit multiply processing element (PE) 2. 4 bit4 bit multiply-add 3. Sum of partial products 4. Variation analysis 5. Extending to 8-bit PEs 6. Extending to 16-bit PEs 7. 256256 16-bit multiply-add array
(20) The hardware-implemented hybrid analog-digital mixed-mode matrix multiply-add PE for ML applications of the present disclosure generally works by the 4-bit multiply PE is a digitally controlled analog multiplier. Specifically, the 4-bit multiply PE provides cross-coupled, digitally controlled current mirrors to perform the analog multiplication. Simply connecting the outputs of such elements provides the sum for these elements. The 4-bit PE is the base cell used to create the 44 bit (or 4 bit4 bit), the 1616 bit (or 16 bit16 bit), and the 256 multiply add array.
(21) To make the hardware-implemented hybrid analog-digital mixed-mode matrix multiply-add PE for ML applications of the present disclosure, a person may start with designing the 4-bit element. This would involve adherence to standard layout practices to minimize variation and allow for abutments to minimize area overhead. The DAC at the input is formed by the input transistors in the PE stack. The data can come from nearby memory elements, such as SRAM bits, or RRAM, or other types of memory. These memory bits should be combined with the analog PE to provide an in-memory machine learning PE. These elements will be combined to form an in-memory full multiply-add hybrid ML PE array. Notably, the 4-bit element is the basic cell and, consequently, can be used to form different array combinations. Also, the ADC could be substituted with direct computation in the analog domain.
(22) Machine learning applications typically require massive quantities of sum of product calculations for each successive node of the neural network. This is demonstrated in
(23) By way of example,
(24) Consequently, when these chips are combined in larger system clusters, the system clusters end up consuming quite a lot of power. In particular, the amount of power consumed is typically on the scale KWatts to MWatts. Therefore, with respect to AI and ML applications, it is imperative to implement the hardware PEs for these matrix-vector multiplication procedures in a very efficient way in terms of speed, area (footprint needed on-chip), and power in order to provide improved performance and cost savings.
(25) By way of example,
(26) Notably, the analog-implemented 4-bit4-bit multiplication PE 200 works by using cross-coupled current mirrors to perform the multiplication function. The bottom stack of the circuit 200 (i.e., NMOS transistors M6, M7, M8, and M9), serves as a 4 bit digital-to-analog converter (DAC) for the data <d3:d0>. The transistor M6 is eight times the size of the transistor M9. The transistor M7 is four times the size of the transistor M9. The transistor M8 is twice the size of the transistor M9. In the corresponding stack of transistors, each transistorM1, M3, M4, and M5have the same size as the corresponding transistorM6, M7, M8, and M9, respectively. These transistors are connected to a current mirror configuration controlled by the voltage on node ref. The current in the branch M9, M5 will be equal to the reference current, while the current in the stack M1, M6 will be eight times the reference current. The connection of the drains of the transistors M1, M3, M4, and M5 provides the summation of the currents of the four branches and a binary representation of the data input <d3:d0> in terms of current.
(27) This current now serves as a reference current for the current mirror formed by M17 and transistors M2, M10 M11, M12. The PMOS transistors M13, M14, M15, and M16 serve as the DAC unit for the weights vector <w3_n:w0_n> (where the notation w_n means inverse of w). In a similar fashion as in the bottom stack, the transistors M13, M2 are eight times the transistors in transistor branch M12, M16. Likewise, the transistors in branch M10, M14 are four times the M12, M16 size, while transistors in the M15, M11 branch are two times the size of the M12, M16 transistors. Notably, the transistors M17, M18 are the same size as M12, M16. The ratio of sizing of the PMOS vs NMOS transistors follows the beta ratio of the process.
(28) The current in the M17, M18 branch will be the total current from the bottom NMOS stack while the current in the M2, M13 branch will be eight times the current in the M17, M18 branch, based on the current mirror function. Similarly, the current in the M10, M14 branch is four times the current in the M17, M18 branch. Finally, the current in the M11, M15 branch is two times the current in the M17, M18 branch. Finally, the current in the M12, M16 branch is one times (or, rather, the same as) the reference current. The end result is that the current at the node out will be equal to the product of the total current of the bottom stack times the current of the top stack, corresponding to a 4 bit data4 bit weight multiplication operation. The role of M18 is to provide the same stack content as the rest of the top stack, offering the same virtual ground for all nodes in the middle of the top stack and to help M17 provide an accurate reference current.
(29) Power consumption is very limited because the whole operation is in the sub-threshold regime of the transistor operation. Furthermore there is no power consumption when the data or the weights are zero, as in the case of sparse matrices, thereby automatically reducing the overall power without requiring special clock gating techniques.
(30) The larger transistors are implemented as copies of the identical small transistors to minimize the same diffusion layout dependent effects (LOD), and offer the same layout context for all transistors. This reduces the overall variability as this now becomes a root mean square function of the same layout transistor variation. It is strongly recommended to use a FinFet process that has a much better sub-threshold slope than old planar devices to further reduce variation and improve accuracy. This is especially important since operation is at the sub-threshold regime of the transistor operation.
(31) By way of another example,
(32) Now referring to some exemplary diagrams,
(33) By way of example,
(34) A visual comparison of the hand calculations to the simulation results is demonstrated in
(35) By way of example,
(36) By way of example,
(37) Now, turning to
(38) A key advantage of the proposed approach is that it requires only a single calibration point to remove the variation across Process, Voltage, Temperature variation (PVT). Since the processing elements use the same reference current, and the top PE uses as reference the bottom PE, a single calibration of the main reference current will compensate for the voltage and temperature variation and part of the process variation. This is demonstrated next, by reference to
(39) As noted, the process calibration occurs once. In particular, this single calibration occurs at the wafer probe stage and accounts for both the reference source trimming and the process specific transistor beta ratio compensation. A single voltage measurement at the output to obtain a specific output value is being used to calibrate the reference current through a comparator and digital reference current search using a simple digital control loop.
(40) Furthermore, the beta ratio adjustment is performed based on the wafer probe basic process monitor data and is adjusted by controllingin a binary digital fashionthe strength of the PMOS reference current, per row, by connecting or disconnecting parallel transistors appropriately to achieve the required strength and beta ratio through, again, a simple control loop.
(41) The proposed multiply-add implementation can be easily expanded by connecting several 4 bit4 bit PE cells together in a row covering sizes of 128 cells to 256 cells, as is typically used in NN applications. The limitation on how many cells to connect comes from the fact that each node needs to be able to drive the full capacitive load of all nodes connected to this summation node, and this affects speed. Buffer stages may be introduced to allow for larger and faster circuit implementations. Another limiting factor may be the max error allowed, as adding too many elements with a limited dynamic range will increase the quantization error of the analog-to-digital converter (ADC) needed to translate the results of the multiplication-addition to the digital domain for further calculations in that domain. An 8 bit accuracy can be achieved relatively easily and this is adequate for most applications. This is shown in detail in
(42) Specifically,
(43) This modular approach can also be easily expanded to 16 bit16 bit multiply-add configurations, such as that shown in
(44) The scale of expansion possible is quite great. By way of example,
(45) Additionally, the hardware-implemented hybrid analog-digital mixed-mode matrix multiply-add PE for ML applications is adaptable for different designs. For instance, the hardware-implemented hybrid analog-digital mixed-mode matrix multiply-add PE for ML applications can be used to produce low power, area efficient GPUs, TPUs, edge computing applications, mobile devices image processing, tensor unit accelerators, etc.
(46) The above-described embodiments of the invention are presented for purposes of illustration and not of limitation. While these embodiments of the invention have been described with reference to numerous specific details, one of ordinary skill in the art will recognize that the invention can be embodied in other specific forms without departing from the spirit of the invention. Thus, one of ordinary skill in the art would understand that the invention is not to be limited by the foregoing illustrative details, but rather is to be defined by the appended claims.