N-POINT COMPLEX FOURIER TRANSFORM STRUCTURE HAVING ONLY 2N REAL MULTIPLIES, AND OTHER MATRIX MULTIPLY OPERATIONS
20220398295 · 2022-12-15
Inventors
Cpc classification
G01S7/2813
PHYSICS
H04B1/0057
ELECTRICITY
G06F17/16
PHYSICS
G01S13/87
PHYSICS
G01S7/026
PHYSICS
G01S7/023
PHYSICS
International classification
G06F17/16
PHYSICS
G06F17/14
PHYSICS
Abstract
An integrated circuit chip implementing multiplication of an M×N element matrix with an N-element vector to obtain an M-element product by combining the vector with rows of bits of the same significance selected from the matrix one bit-row at a time to form partial products, exploiting the fact that the same potential combinations are needed for all bit-rows and all matrix rows to precompute all of the combinations once and for all, and combining selected partial products for different bit place-significance with a shift-and-add operation only once for each of the M product elements, thereby effectively using only M multiply-equivalent structures. An N-point Complex Fourier Transform can therefore be claimed which only needs 2N real multiplies and the product of an N×N matrix with another N×N matrix requires only N.sup.2 multiplies.
Claims
1. An integrated circuit comprising a digital logic structure configured for performing an operation of multiplication of an M×N matrix of multi-digit coefficients with an N-element vector of multi-digit values to produce M output values, comprising: adder trees configured to add L groups of the N vector elements with all possible multiplicative weights combinations, wherein each weight takes on all possible values of a digit in the number base of said multi-digit matrix values to produce all possible weighted combinations of the L vector values in a group, and wherein a sum of the L groups is equal to N, wherein the combinations representing all possible partial products of the group of L vector values with digits of equal place significance from L corresponding matrix values; a criss-cross structure of conductors comprising a plurality of parallel conductors in one dimension corresponding to the number of combinations computed by the adder trees for all of the groups of vector values and a plurality of cross conductors in the other dimension, each of the latter joining a set of adder cells in a string to the binary tree, wherein the number of adder strings or trees are equal to the number of real output values to be computed multiplied by the word length in bits of the multi-digit values of the matrix, wherein the adder cells are placed at the crossings of the conductors to combine partial products for all groups of L vector values, wherein the placement of each adder selects the correct partial product for the actual set of L digits of the L matrix values in a group, wherein the output of an adder feeding down the crossing conductors to the input of the next adder in sequence in the same string or binary tree to obtain a final sum of partial products from the final adder in the string or tree; and a set of delay-and-add or shift-and-add circuits for each of the M output values for combining the outputs from the final adders of the adder strings or trees taking into account place significance of the matrix digits used to compute the selected partial products to produce the desired output value as the product of a matrix row with the N element vector.
2. The integrated circuit of claim 1, wherein the multi-digit values are binary values, and the number base is 2.
3. The integrated circuit of claim 1, wherein the multi-digit matrix values are binary and are positive or negative, and wherein the digital logic structure is configured to preconvert the multi-digit matrix values to be all positive by adding the largest value to all.
4. The integrated circuit of claim 1, wherein each of the multi-digit matrix values are binary and are positive or negative but with magnitudes less or equal to 1, and wherein the digital logic structure is configured to preconvert the multi-digit matrix values to be in the range 0 to +1 by adding 1 to all and dividing by 2.
5. The integrated circuit of claim 1, wherein the digital logic structure is configured to multiply a complex M×N matrix with a complex N-element vector to form M complex results, wherein the digital logic structure is configured to form precombinations of the real vector values and separately precombinations of the imaginary vector values; wherein the digital logic structure further comprises strings or binary trees of adders configured to add partial products of real matrix value digits multiplied by real vector parts and to subtract partial products of imaginary matrix value digits multiplied by imaginary vector parts to form a partial product of the desired real result value, and second strings or trees configured to add partial products of real matrix value digits multiplied by imaginary vector parts and to add partial products of imaginary matrix value digits multiplied by real vector parts to form a partial product of the desired imaginary result value; and wherein the digital logic structure is further configured to further combine the partial products for matrix value digits of different place significance by delay-and-add or shift-and-add operations to account for place significance.
6. The integrated circuit of claim 1, wherein the digital logic structure is configured to perform a fully parallel, N-point complex Fourier Transform using only 2N real-multiplier-equivalent structures.
7. The integrated circuit of claim 1, wherein an adder cell of the set of adder cells comprises a feedback carry delay.
8. The integrated circuit of claim 1, wherein the set of delay-and-add circuits comprise serial multipliers.
9. The integrated circuit of claim 8, wherein the set of shift-and-add circuits comprise registers and are configured to clock the partial products into the registers and add them with a relative shift.
10. The integrated circuit of claim 1, wherein the adder trees comprise adders configured as serial adders.
11. The integrated circuit of claim 10, wherein each of the serial adders are configured to stream in values LSB first on single lines, and wherein each adder is configured to add two bits plus a carry from its previous addition and to output one bit plus a new carry which is fed back through a delay element to the input of the same adder.
12. The integrated circuit of claim 11, wherein the delay element is a flip flop or an arrangement of switched capacitors.
13. A method of multiplying with a digital logic structure, an M×N matrix with a N-element vector to obtain an M-element result, comprising the steps of: expressing said matrix values as a set of place-significance-ordered values in a number base; grouping digits of like significance of the values in the same row of matrix coefficients to form groups of L digits; forming, with strings or binary trees of adders of the digital logic structure, precombinations of the L vector values to be multiplied by the corresponding L matrix values, by multiplicatively weighting and adding the L vector values using the values of the digits in the number base as weights, wherein the weights each take on all possible values of a digit in the number base to form partial products of L vector values with a digit of one significance of the corresponding L matrix coefficients; further combining, with strings or binary trees of adders of the digital logic structure, the partial products from different groups of L matrix values and corresponding vector values based on selecting digits of the same significance to obtain complete partial products of a row of N like-significant digits of said matrix values with said N vector values; further combining the complete partial products computed from digits of different significance with a delay-and-add or shift-and-add operation to take account of the different place significance to thereby obtain the product of a matrix row with the N-element vector; and repeating the above steps for each matrix row to obtain the product of the M×N matrix with the N-element vector.
14. The method of claim 13, wherein the matrix and vector values are complex values having a real and an imaginary part.
15. The method of claim 13, wherein the delay-and-add operation is performed using a set of delay-and-add circuits of the digital logic structure, and wherein the delay-and-add circuits comprise serial multipliers.
16. The method of claim 13, wherein the matrix and vector values are used in at least one of Fourier transforms and transmit/receive beamforming calculations.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0018]
[0019]
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
DESCRIPTION OF THE PREFERRED EMBODIMENTS
[0027] The present invention will now be described with reference to the accompanying figures, wherein numbered elements in the following written description correspond to like-numbered elements in the figures. Methods and systems of the present invention may include a logic structure suitable for chip integration that performs multiplication of an M×N matrix of multi-bit values to a vector of N multi-bit values in parallel, yielding all outputs at the same time. The exemplary structure treats a single bit of each element of one row of the matrix at a time, bits of like significance forming a row of single digits that in the binary case may be regarded as having values of (1 or 0), (1 or −1) or (1, 0, or −1). The latter ternary states arise if the matrix row values are in sign-magnitude form.
[0028] U.S. Pat. No. 6,219,365 to current inventor Paul W. Dent, filed 19 Jan. 1999 and entitled “Apparatus for Performing Multiplication of a Vector of Multi-Bit Values by a Matrix of Multi-Bit Coefficients” describes a “fast” matrix times vector method and apparatus when the matrix is fixed and the vector is variable by multiplying the matrix to a column of single bits of the vector elements at a time, the bits being of like place-significance. Since the bits take on only one of two values, for example (1 or 0), or (1 or −1), this multiplication generates only one of a limited number of all possible sums and differences of the matrix coefficients, which combinations can be precomputed and stored in look-up tables. In case the look-up tables become too large, the bit vector can be divided into smaller bit vectors that are multiplied by a correspondingly smaller number of matrix coefficients leading to smaller look-up tables. In the transmit beamforming case, it was disclosed that modulation of digital data on to a radio frequency carrier using linear modulation can be exchanged in order with the linear operation of transmit beamforming, such that the transmit beamforming only need operate on a single column of data bits at the data bit rate. The outputs of the beamformer were then subjected to the linear modulation operation, up-sampling and filtering after beamforming to produce spectrally-shaped I,Q samples at a sample rate of multiple samples per data bit. Thus, switching the order of modulation and beamforming resulted in the beamformer input vector having only a single column of binary values and the matrix multiplication with it takes place only at data bit rate instead of the higher I/O sample rate. Moreover, there are no multiplies to be performed. The U.S. Pat. No. 6,219,365 patent is hereby incorporated by reference herein in its entirety.
[0029] Multiplication is a more complex operation than addition, and thus, there is a strong motivation to reduce multiplication operations. Multiplier hardware structures take more chip area and power than adder structures so there is also strong motivation to reduce multiplier structures needed for a given speed of computation.
[0030] The above-incorporated '365 patent also discloses how to perform fully parallel multiplication of an N×N matrix of multi-bit values to a vector of N multi-bit values using only N multiplier structures compared to the N.sup.2 that would be needed with a conventional approach.
[0031] The matrix-times-multi-bit-vector method of the '365 patent comprises performing matrix multiplication with a single column of the vector's bits of like significance at a time using look-up tables precomputed as a function of the matrix coefficients, and then combining the results for bit columns of different place significance by shifting the results according to place significance and adding. The latter operation is analogous to a multiplier structure that adds partial products; however, there is only one such structure needed per output value computed.
[0032] In a current application, a matrix-times-vector operation is required in which the matrix is M×N and the vector is of length N, and the number of rows M of the matrix is much greater than the number of columns, N; for example, N=256 and M=8192. In that case, the method of the '365 patent results in an excessive number of look-up tables to be precomputed and stored in memory. Therefore, an alternative method is sought which is described herein.
[0033] Referring to
[0034] It may be deduced from
R1=1*S0+0.9*S1+0.9*S2−0.75*S3+0.8*S4+0.9*S5,
where * stands for fixed point multiplication.
[0035]
[0036] The selection of fewer than all N row values (N is only six in this case) results in a reduction of the number of combinations of the vector values that have to be formed. Using an exemplary L=3 and looking at the 3rd bit down, the first group 100 in
[0037] There can only be 3×3×3=27 possible combinations of three vector values needed whatever the matrix coefficients. The same 27 combinations suffice for all bit rows of all matrix rows, and thus need be formed once only.
[0038] For a larger N, it is desirable for L to be as large as possible, but if the single digits in the rows can take on ternary values, the number of combinations to be formed is 3.sup.L, which is 243 for L=5. For binary symbols, the value of L can go up to 8, so N can be divided into a smaller number of groups of 8. For example, if N=256, 32 groups of 8 can be used, and each group of 8 results in needing 256 precombinations of 8 vector values to be formed. There are 32 groups of 8 for N=256, so 32 times 256 combinations have to be formed. Alternatively, if L=4, 64 groups of 16 combinations would be needed. It is also possible to use different values of L for different groups if no one L divides into N. For example, if N=255, 31 groups of 8, and one group of 7, could be used, or 51 groups of 5. The greater the number of combinations that are precomputed at this stage, the fewer additions of partial products that have to be combined later, so there a tradeoff between the silicon area and power needed to form the precombinations and the later complexity. This tradeoff depends on how many output values are needed, and a larger number M of output values favors forming more precombinations early on.
[0039] Ternary values may be avoided by noting that in Fourier transform-like operations, such as DFTs or antenna beamforming, all the matrix values have real and imaginary parts that lie between −1 and +1. Consideration of the complex case occurs later herein but consider for now a real matrix comprising only cosines or sines with values between +1 and −1. These are all rendered positive by adding 1 to every matrix element, such that the values then lie between 0 and 2. The next step is dividing by 2 so that they all lie between 0 and 1. Adding 1 to all matrix values is equivalent to adding the sum of all the vector values to each result and is therefore compensated by subtracting the sum of all vector values from each of the final results. The division by 2 may be compensated if desired by first multiplying each result by 2 before subtracting the sum of the vector values; alternatively, half the sum of the vector values may be subtracted.
[0040]
[0041]
[0042]
[0043]
[0044] In ) 600 represents a 1-bit delay element, such as a flip flop (there are array of enclosed D's (
) 600 along the bottom of
[0045] “B7” represents the row bits just to the right of the binary point, that is, 011 011. The first 011 group signifies the addition of S1 and S2, therefore, a dot (.circle-solid.) (serial adder) 602 is place on the crossing of the “B7” vertical line with the horizontal line carrying the S1+S2 combination. The second group 011 corresponds to the addition of S4 and S5. Therefore, the “B7” vertical line also has a serial adder (.circle-solid.) 602 on the horizontal line corresponding to the combination S4+S5. The vertical lines thus join the output of one adder to the input of the next to form an adders string. Thus, having passed through all adders in the string, the result at the end of the string is the product of the vector with one digit-row of one matrix row, the digits in the row being of the same place significance. Vertical line “B8” carries the serial product of greatest place significance, “B7” is a factor 2 less significant, and so on, down to the least significant partial product on line Bo. These shall all now be added with shifts corresponding to their place significance, which is achieved by delaying bits of high significance in delay elements D () 600 so they match up with bits of equal significance in the next least significant partial product. The bit streams are LSB first, so later bits of higher significance. After adding the partial products with place-significant shifts, the final output Ro is the dot product of the first matrix row with the input vector.
[0046] The structure of
[0047] The physical size of the chip structure can be estimated. For example, each group of L bits creates 2.sup.Lcombinations of the input vector values. There are N/L such groups, therefore, the number of horizontal lines is N.2.sup.L/L−1.
[0048] The number of vertical lines is equal to the product of the number of matrix rows with the number of bits precision of each matrix coefficient. For example, if N=256 and L=8, there are 8,191 horizontal lines, and if the matrix coefficient precision is 9 bits and there are 8,192 matrix rows, there will be 73,728 vertical lines.
[0049] Modern semiconductor chips allow 50 nm line-spacing and have, for example, up to ten metal layers. Using only one layer of metallization for the horizontal and vertical lines, 73,728×50 nm=3.7 mm, and the 8,192 horizontal lines occupy 0.4 mm. Thus, the main part of the structure fabricated as
[0050] In an exemplary 5 nm silicon process, it is conceivable that a feasible serial bit rate through the serial adders is 16 GB/s. The benefit of serial adders is that there is no carry propagation to wait for—that being explicitly built into the carry feedback. Assuming a final word length growth to 32 bits, the circuit can perform one such matrix x vector operation every 2 ns. This is equivalent to over 10.sup.15 fixed-point multiply-accumulates per second.
[0051] A structure for the case where all values are complex will now be developed, using
[0052] As before, the N bit row corresponding to like-significant bits of the binary expanded real parts (101) is divided into groups of L bits, for example, where N=6 in
[0053] In ) 604 signifies a serial subtractor cell. The only difference between a serial subtractor and a serial adder is that the quantity to be subtracted is logically complemented on input and the carry-in is initialized to 1 rather than 0. Subtractors are necessary to form the real parts that comprise Rro, arising from the formula for the real part of the product of two complex numbers:
Real×Real−Imag×Imag,
[0054] but only adders are required to form the imaginary part Rio as the imaginary part of a complex product is:
Real×Imag+Imag×Real.
[0055] Also, to simplify ) 900. One string of delay-and-adds combines the real partial products to obtain Rro while a second string of delay-and-adds combines the imaginary partial products to obtain Rio. Subtraction of the sum of all real parts is not shown, but is performed to compensate for the original addition of 1 to all real parts as for the real case, using a vertical line having an adder to combine the precombinations SRo+SR1+SR2 and SR3+SR4+SR5.
[0056] Likewise, the final imaginary result is compensated by subtracting the sum of all imaginary vector values formed by a second vertical line having an adder to combine SIo+SI1+SI2 with SI3+SI4+SI5.
[0057] It may be mentioned that a “string” of adders in series may beneficially be replaced by a binary tree of adders, in which pairs of values at a time are added in a first rank of adders, then pairs of first rank adder outputs are added in second adders and so forth, the number of adders being the same, but leading to simpler carry-flushing in the serial adder case due to the tree depth being only Loge of the number of adders. Apart from the latter characteristic these two structures shall be regarded herein as functionally interchangeable.
[0058]
[0059] In
[0060] Exemplary embodiments can be used to efficiently implement common algorithms that can be expressed as Matrix×Vector. For example, the Discrete N-point Fourier Transform algorithm (also referred to as a complex Fourier Transform) can be expressed as the multiplication of an N×N complex matrix to an N-element complex vector. As the Fourier Transform Matrix is fixed but the vector to be transformed is variable, the inventive algorithm described herein is appropriate. The DFT would be computed with the equivalent of only 2N real multiply-equivalent operations instead of the 4N.sup.2 needed for a DFT or the 4N log.sub.2(N) real multiplies that are needed with the Fast Fourier Transform. For N=256, this is a factor of 512 times more efficient than the DFT and 16 times more efficient than an FFT. The efficiency gain may translate into lower power consumption when computing a large number of transforms continuously. The chip areas of 1.5 mm.sup.2 estimated previously for a 256-in, 8,192-out real matrix multiply becomes 6 mm.sup.2 for the complex case. A 256-point Fourier transform engine with 256 in and 256 out is 1/32nd of that size, which is about 0.2 mm.sup.2 and performs a transform perhaps every 2 ns.
[0061] Although the number base envisioned herein is principally binary, and in some cases ternary, the principle discussed herein is valid for any number base, such as decimal or hexadecimal, although not obviously as efficient for full custom chip implementation.
[0062] Accordingly, an exemplary logic structure suitable for chip integration performs multiplication of an M×N matrix of multi-bit values to a vector of N multi-bit values in parallel, yielding all outputs at the same time. The exemplary structure treats a single bit of each element of one row of the matrix at a time, with bits of like significance forming a row of single digits that in the binary case may be regarded as having values of (1 or 0), (1 or −1), or (1, 0, −1). The latter ternary states arise if the matrix row values are in sign-magnitude form. The row of single digits is then multiplied to the multi-bit vector. Since the digits are only +/−1 or 0, no multiplication is involved, and the result is simply sums and differences of the multi-bit vector. The exemplary structure forms all possible sums and differences of groups of the multibit vector elements where a group size L can be smaller than the vector length N to keep the number of sums and differences, which is either 2.sup.L or 3.sup.L within a reasonable number.
[0063] Changes and modifications in the specifically-described embodiments may be carried out without departing from the principles of the present invention, which is intended to be limited only by the scope of the appended claims as interpreted according to the principles of patent law including the doctrine of equivalents.