RANK-BASED DOT PRODUCT CIRCUITRY
20200125329 ยท 2020-04-23
Assignee
Inventors
Cpc classification
G06F7/5336
PHYSICS
International classification
Abstract
Integrated circuits with dot product circuitry are provided. The dot product circuitry may be configured to generate partial products of different ranks based on the inputs. The partial products may be organized into corresponding groups based on their ranks. Each group of partial products having the same rank can then be compressed using a compressor/reduction tree. At least some of the compressed partial product values may be shifted between the different groups to maintain the proper offset. Each partial product may have an associated one's to two's complement conversion bit. The conversion bits of the various partial product groups can be separately aggregated and then injected into the compressor tree at one or more locations.
Claims
1. An integrated circuit, comprising: a partial product generation circuit configured to receive input operands and to generate corresponding partial products; a first compressor circuit configured to receive a first group of the partial products all having a first rank and configured to output first vectors; and a second compressor circuit configured to receive a second group of the partial products all having a second rank that is different than the first rank and configured to output second vectors.
2. The integrated circuit of claim 1, wherein the first group of the partial products are not shifted relative to each other.
3. The integrated circuit of claim 2, wherein the second group of the partial products are not shifted relative to each other.
4. The integrated circuit of claim 1, further comprising: first shifting circuits configured to shift the second vectors relative to the first vectors.
5. The integrated circuit of claim 4, further comprising: a third compressor circuit configured to receive a third group of the partial products all having a third rank that is different than the first and second ranks and configured to output third vectors; and a fourth compressor circuit configured to receive a fourth group of the partial products all having a fourth rank that is different than the first, second, and third ranks and configured to output fourth vectors.
6. The integrated circuit of claim 5, further comprising: second shifting circuits configured to shift the fourth vectors relative to the third vectors.
7. The integrated circuit of claim 6, further comprising: a fifth compressor configured to compress the first vectors and the second shifted vectors and configured to output fifth vectors; and a sixth compressor configured to compress the third vectors and the fourth shifted vectors and configured to output sixth vectors.
8. The integrated circuit of claim 7, further comprising: third shifting circuits configured to shift the sixth vectors relative to the fifth vectors.
9. The integrated circuit of claim 8, wherein the third shifting circuits are selectively bypassable to support a plurality of input precisions.
10. The integrated circuit of claim 9, further comprising: a seventh compressor configured to compress the fifth vectors and the sixth vectors to output corresponding seventh vectors; and a carry-propagate adder configured to receive the seventh vectors and to output a corresponding dot product value.
11. The integrated circuit of claim 1, further comprising: an aggregation circuit configured to aggregate one's to two's complement conversion bits associated with the partial products.
12. The integrated circuit of claim 11, wherein the aggregation circuit is further configured to aggregate the conversion bits into a single vector.
13. The integrated circuit of claim 11, wherein the one's to two's complement conversion bit aggregation circuit is further configured to aggregate the conversion bits into at least two different vectors.
14. An integrated circuit, comprising: partial product generation circuitry configured to receive input signals and to generate a plurality of partial products; and a compressor tree divided into a plurality of compressor groups organized based on the rank of the partial products received at each of the plurality of compressor groups, and wherein the partial products in each of the plurality of compressor groups have identical ranks.
15. The integrated circuit of claim 14, further comprising: a one's to two's complement conversion bit aggregation circuit configured to generate at least one vector that is injected at a single point in the compressor tree.
16. The integrated circuit of claim 14, further comprising: a one's to two's complement conversion bit aggregation circuit configured to generate at least two vectors that are injected at two different points in the compressor tree.
17. The integrated circuit of claim 14, wherein the compressor tree comprises a set of shifting circuits that is switched into use when operating the compressor tree to support a first precision mode and that is switched out of use when operating the compressor tree to support a second precision mode different than the first precision mode.
18. The integrated circuit of claim 17, further comprising: a first one's to two's complement conversion bit aggregation circuit; a second one's to two's complement conversion bit aggregation circuit; and a multiplexer configured to select only the first one's to two's complement conversion bit aggregation circuit during the first precision mode and to select only the second one's to two's complement conversion bit aggregation circuit during the second precision mode.
19. An integrated circuit, comprising: dot product circuitry that is decomposed into a first dot group and a second dot group to reduce compressor word growth in the dot product circuitry, wherein the first dot group has a first number of multiplies, and wherein the second dot group has a second number of multiplies that is different than the first number of multiplies.
20. The integrated circuit of claim 19, further comprising: a first aggregation circuit configured to aggregate conversion bits associated with the first dot group; a second aggregation circuit configured to aggregate conversion bits associated with the second dot group; and a compressor configured to compress values received from the first and second aggregation circuit.
21. The integrated circuit of claim 19, wherein the dot product circuitry is further decomposed into a third dot group having a third number of multiplies that is different than the first and second numbers of multiplies.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0004]
[0005]
[0006]
[0007]
[0008]
[0009]
[0010]
[0011]
[0012]
[0013]
[0014]
[0015]
[0016]
[0017]
[0018]
DETAILED DESCRIPTION
[0019] The present embodiments relate to dot product circuitry that group multiplier partial products according to their rank instead of the multiplier with which they are associated. Moreover, the ones and twos complement bits for sign extension may also be grouped by rank and processed on a rank basis prior to combining with the sum of the partial products. Dot product circuitry configured and operated in this way are at least 20% smaller in size at any given speed while consuming less power.
[0020] It will be recognized by one skilled in the art, that the present exemplary embodiments may be practiced without some or all of these specific details. In other instances, well-known operations have not been described in detail in order not to unnecessarily obscure the present embodiments.
[0021]
[0022] As shown in
[0023]
[0024] In practice, most ASIC multipliers are implemented using radix-4 Booth coding. In contrast to typical grade-school binary (i.e., radix-2) multiplication where the multiplicand is multiplied with each bit in the multiplier to generate a corresponding number of partial products that is equal to the total number of bits in the multiplier, radix-4 Booth boding generates one partial product for every two multiplier bits.
[0025]
[0026]
[0027]
[0028] Each set of four partial products associated with a multiplier may be combined or compressed using a respective compressor circuit 402. Since there are four partial products in each multiplier group, each compressor circuit 402 has four inputs each of which is configured to receive one of the four 9-bit partial products in each group and to generate a first sum/carry vector at a first output and a second sum/carry vector at a second output. In the example of
[0029] In this example, note that the word width of the output vectors is much higher than the width of each input vector. For instance, each output of 4-2 compressor 402 may be 16 bits, which is much wider than the 9-bit partial products. This increase in bit width from input to output (sometimes referred to as compressor word growth) is due to the fact that the various partial products are offset/shifted with respect to one another. As shown in
[0030] This word growth in multiplier redundant outputs of the compressors has a significant impact on the size of the overall dot product circuit 400. As the precision of inputs increases beyond 8 bits, which could increase the total number of partial products and would potentially exacerbate the word growth, and as number of multiplies increases beyond three, which would directly increase the total number of compressors if using the dot product architecture of
[0031] There is another consideration when building dot product circuits in this way. Since the inputs are signed numbers, the corresponding partial products may also need to be sign extended. In accordance with an embodiment, a negative number may be represented using two's complement, which requires first inverting all the bits to generate the one's complement and then converting to the two's complement by adding a 1 at the least significant bit (LSB) position. Thus, there needs to be a way to add a 1 to the LSB of each negative partial product after the multiplicand has first been inverted.
[0032] This is illustrated in
[0033] The 1s need to be added in case a partial product is negative. Most of these ones are relatively inexpensive but may still require decoding. For instance, the 1 associated with a negative first partial product PP1 can be hidden with the second partial product PP2. Similarly, the 1 associated with a negative PP2 can be hidden with the third partial product PP3. Moreover, the 1 associated with a negative PP3 can also be hidden with the fourth partial product PP4. However, the 1 associated with a negative PP4 is problematic and cannot be hidden like the others, so a separate compression function is required to account for this extra bit.
[0034] In accordance with an embodiment,
[0035] As shown in
[0036] Instead of shifting the partial products before the compression, dot product circuitry 600 may be configured to shift the compressed vectors of each rank. In
[0037] The compressed vectors output from compressor 608-2 may be further shifted by four bits to the left using (<<4) shifting circuits 610 relative to the compressed vectors output from compressor 608-1 since the left two groups are offset by four rank positions relative to the right two groups. The two output vectors from compressor 608-1 and the two (<<4) shifted output vectors from compressor 608-2 may then be compressed using a 4-2 compressor 612 until only two sum/carry vectors remain. This point, the two resulting sum/carry vectors may be summed together using adder circuit 614 to generate the final dot product value. For example, adder circuit 614 may be implemented as a carry propagate adder (CPA) or other suitable adder circuit.
[0038] Although only 3-2 and 4-2 compressors are shown in
[0039] Thus, in the example of
[0040] Another improvement that can be made is for the addition of the one's to two's complement LSBs. Rather than distributing the LSB 1s on a partial product by partial product basis as shown in the example of
[0041]
[0042] Depending on the structure and balance of the compressor tree (e.g., a compressor tree of the type shown in
[0043] The example of
[0044]
[0045] In this example, the first column 800-0 of LSBs represents the possible one's to two's complement conversion bits associated with the rank 0 partial products. The second column 800-2 of LSBs represents the possible one's to two's complement conversion bits associated with the rank 2 partial products. The third column 800-4 of LSBs represents the possible one's to two's complement conversion bits associated with the rank 4 partial products. The fourth column 800-6 of LSBs represents the possible one's to two's complement conversion bits associated with the rank 6 partial products. The various columns are still shifted/offset by 2-bit steps based on their respective ranks.
[0046] Since there are now six total partial products, the unary to binary coding of the 1s can now result in a value of up to 6, which now requires 3 bits. As a result, alternate column values will now potentially overlap (see, e.g., 1-bit overlapping portion 802 between the aggregated values of columns 800-0 and 800-2, 1-bit overlapping portion 804 between the aggregated values of columns 800-2 and 800-4, and 1-bit overlapping portion 806 between the aggregated values of columns 800-4 and 800-6. Due to this overlap, the sums of the four different rank groups cannot be simply appended together like shown in
[0047] One way of dealing with this overlap is to inject the two row values at two different points in the compressor tree (see, e.g.,
[0048] Another way of handling this overlap is to combine the two rows together and then add the combined value at a single point in the compressor tree (see, e.g.,
[0049]
[0050] Since there are now 12 total partial products in each group, the unary to binary coding of the 1s can now result in a value of up to 12, which now requires 4 bits. Encoding all the ones using only two rows of 4-bit chunks (as shown by arrows 930) may be very expensive. To save on cost, it is much more efficient to divide the 12-high columns into two 6-bit half columns and then compress the binary halves together or using some combination of addition and/or compression (e.g., carry-propagate addition, carry-save addition, or other suitable addition operation). As shown in portion 940, the four 12-high columns can be divided into eight 6-high half columns (each having a max value of up to 6), resulting in eight 3-bit values. To combine these eight values, the values of the same rank may first be added together. Thereafter, the conversion described above in connection with
[0051] Referring back to the example of
[0052]
[0053] Consider another example where a dot product circuit of the type shown in
[0054] The one's to two's conversion bit columns are handled slightly differently. The unary to binary combinatorial conversions will be bounded at the larger multipliers with shallower columns, so when the bits are aligned for the smaller multipliers with deeper columns, the binary values just stack on top of each other. There may be two different compression/addition circuits for the two multiplier precisions, with a multiplexer that can select between the two for the actual conversion vector used.
[0055]
[0056] In the INT4 case, the 20-high column can still be divided into 5-bit chunks with binary counts having a max value of 5 (e.g., a 3-bit value), which are still offset by 2 bits with respect to each other. The binary counts of the two upper columns will now be aligned directly underneath the two lower columns. Similarly, these values may be summed together using a variety of adder architectures as represented by summation circuitry 1104 to aggregate the total contribution of the conversion bits. As described above, a multiplexing circuit such as multiplexer 1106 may be used to select between the two aggregate values depending on the current precision (e.g., depending on whether the current mode is supporting INT8 or INT4). If desired, the dot product circuitry may also be dynamically configured to support INT2 operation, INT16 operation, INT32 operation, INT64 operation, etc. by optionally bypassing one or more shifting circuits in the compressor tree and/or by stacking partial products or conversion LSBs among two or more different groups.
[0057] In accordance with another suitable arrangement not mutually exclusive with any of the embodiments described in connection with
[0058] As discussed in connection with at least
[0059] Meanwhile, the conversion bits 1200-2 associated with the 8-element dot group may be aggregated by first dividing the 8-high columns of 1s into two 4-bit half columns and then combining the binary values together using some combination of addition and/or compression (e.g., carry-propagate addition, carry-save addition, or other suitable addition operation) as represented by adder 1202. The maximum aggregate LSB value for the group of 8 columns is 680 (i.e., 8+8<<2+8<<4+8<<6=680) A final 3-2 compressor 1204 may then compress the aggregate values from the two groups, and the resulting two vectors summed together using a final CPA 1206.
[0060] This example of splitting up a 20-element multiply into two groups of different sizes is merely illustrative. As another example, a 10 multiplier vector may be decomposed into 6 and 4 multiplier vectors. In general, a larger multiplier can be decomposed into two or more multiplier vectors of the same or different sizes (e.g., decomposed into three dot groups having multipliers of different sizes, into four dot groups with multipliers of different sizes, into more than four dot groups having multipliers of different sizes, etc.), which will allow the compressor structures to be tuned separately for optimal performance. In the 10-element multiplier mentioned above, the 6 multiplier vector reduction scheme can have two 3-2 compressors in the first level followed by a 4-2 compressor. The 4 multiplier vector reduction scheme can be handled using a 4-2 compressor.
[0061] The embodiments described here relating to radix-4 (R4) Booth coding is merely illustrative and is not intended to limit the scope of the present embodiments. If desired, these techniques for improving the partial product reduction/compression and the conversion bit aggregation may be extended to multiplier operations implemented using simple radix-2 multiplies (e.g., by multiplying the multiplicand with one bit of the multiplier at a time, which would double the number of partial products relative to R4), radix-8 Booth coding, radix-16 Booth coding, just to name a few. For radix-8 (R8) Booth coding, the offsets/shifting between the different partial product groups will be three bits instead of two bits. Thus, in the dot product architecture of
Examples
[0062] The following examples pertain to further embodiments.
[0063] Example 1 is an integrated circuit, comprising: a partial product generation circuit configured to receive input operands and to generate corresponding partial products; a first compressor circuit configured to receive a first group of the partial products all having a first rank and configured to output first vectors; and a second compressor circuit configured to receive a second group of the partial products all having a second rank that is different than the first rank and configured to output second vectors.
[0064] Example 2 is the integrated circuit of example 1, wherein the first group of the partial products are optionally not shifted relative to each other.
[0065] Example 3 is the integrated circuit of example 2, wherein the second group of the partial products are optionally not shifted relative to each other.
[0066] Example 4 is the integrated circuit of any one of examples 1-3, optionally further comprising: first shifting circuits configured to shift the second vectors relative to the first vectors.
[0067] Example 5 is the integrated circuit of example 4, optionally further comprising: a third compressor circuit configured to receive a third group of the partial products all having a third rank that is different than the first and second ranks and configured to output third vectors; and a fourth compressor circuit configured to receive a fourth group of the partial products all having a fourth rank that is different than the first, second, and third ranks and configured to output fourth vectors.
[0068] Example 6 is the integrated circuit of example 5, optionally further comprising: second shifting circuits configured to shift the fourth vectors relative to the third vectors.
[0069] Example 7 is the integrated circuit of example 6, optionally further comprising: a fifth compressor configured to compress the first vectors and the second shifted vectors and configured to output fifth vectors; and a sixth compressor configured to compress the third vectors and the fourth shifted vectors and configured to output sixth vectors.
[0070] Example 8 is the integrated circuit of example 7, optionally further comprising: third shifting circuits configured to shift the sixth vectors relative to the fifth vectors.
[0071] Example 9 is the integrated circuit of example 8, wherein the third shifting circuits are optionally selectively bypassable to support a plurality of input precisions.
[0072] Example 10 is the integrated circuit of example 9, optionally further comprising: a seventh compressor configured to compress the fifth vectors and the sixth vectors to output corresponding seventh vectors; and a carry-propagate adder configured to receive the seventh vectors and to output a corresponding dot product value.
[0073] Example 11 is the integrated circuit of any one of examples 1-10, optionally further comprising: an aggregation circuit configured to aggregate one's to two's complement conversion bits associated with the partial products.
[0074] Example 12 is the integrated circuit of example 11, wherein the aggregation circuit is optionally further configured to aggregate the conversion bits into a single vector.
[0075] Example 13 is the integrated circuit of example 11, wherein the one's to two's complement conversion bit aggregation circuit is optionally further configured to aggregate the conversion bits into at least two different vectors.
[0076] Example 14 is an integrated circuit, comprising: partial product generation circuitry configured to receive input signals and to generate a plurality of partial products; and a compressor tree divided into a plurality of compressor groups organized based on the rank of the partial products received at each of the plurality of compressor groups, and wherein the partial products in each of the plurality of compressor groups have identical ranks.
[0077] Example 15 is the integrated circuit of example 14, optionally further comprising: a one's to two's complement conversion bit aggregation circuit configured to generate at least one vector that is injected at a single point in the compressor tree.
[0078] Example 16 is the integrated circuit of example 14, optionally further comprising: a one's to two's complement conversion bit aggregation circuit configured to generate at least two vectors that are injected at two different points in the compressor tree.
[0079] Example 17 is the integrated circuit of example 14, wherein the compressor tree optionally comprises a set of shifting circuits that is switched into use when operating the compressor tree to support a first precision mode and that is switched out of use when operating the compressor tree to support a second precision mode different than the first precision mode.
[0080] Example 18 is the integrated circuit of example 17, optionally further comprising: a first one's to two's complement conversion bit aggregation circuit; a second one's to two's complement conversion bit aggregation circuit; and a multiplexer configured to select only the first one's to two's complement conversion bit aggregation circuit during the first precision mode and to select only the second one's to two's complement conversion bit aggregation circuit during the second precision mode.
[0081] Example 19 is an integrated circuit, comprising: dot product circuitry that is decomposed into a first dot group and a second dot group to reduce compressor word growth in the dot product circuitry, wherein the first dot group has a first number of multiplies, and wherein the second dot group has a second number of multiplies that is different than the first number of multiplies.
[0082] Example 20 is the integrated circuit of example 19, optionally further comprising: a first aggregation circuit configured to aggregate conversion bits associated with the first dot group; a second aggregation circuit configured to aggregate conversion bits associated with the second dot group; and a compressor configured to compress values received from the first and second aggregation circuit.
[0083] Example 21 is the integrated circuit of any one of examples 19-20, wherein the dot product circuitry is optionally further decomposed into a third dot group having a third number of multiplies that is different than the first and second numbers of multiplies.
[0084] For instance, all optional features of the apparatus described above may also be implemented with respect to the method or process described herein. The foregoing is merely illustrative of the principles of this disclosure and various modifications can be made by those skilled in the art. The foregoing embodiments may be implemented individually or in any combination.