MULTIPLY-ACCUMULATE CIRCUIT AND METHOD FOR PERFORMING MULTIPLY-ACCUMULATE OPERATIONS
20240296012 ยท 2024-09-05
Inventors
Cpc classification
International classification
Abstract
A multiply-accumulate circuit for processing numerical values that are present as input words, each of which is formed from at least two partial words. The circuit is configured, corresponding to a permutation selected from a plurality of permutation possibilities implemented by the multiply-accumulate circuit, to form product partial words as products of in each case one partial word of the first input word with one partial word of the second input word, wherein in the products, the partial words of the first input word are permutated relative to their original order corresponding to the selected permutation; and to add the product partial words with an accumulation word, which is formed from one or more partial words, to determine an updated accumulation word in which product partial words are in each case added to one of the one or more partial words of the accumulation word.
Claims
1. A multiply-accumulate circuit for processing numerical values that are present as input words, each of which is formed from at least two partial words, the circuit being configured to: form product partial words as products of each partial word of the first input word of the input words with a different partial word of the second input word of the input words, wherein the partial words of the first input word are permutated relative to their original order corresponding to a permutation selected from a plurality of permulation possibilities implemented by the multiply-accumulate circuit; and add the product partial words with an accumulation word, which is formed from one or more partial words, in order to determine an updated accumulation word in which product partial words are in each case added to one of the one or more partial words of the accumulation word.
2. The multiply-accumulate circuit according to claim 1, wherein the permutation selection is given by a permutation selection signal.
3. The multiply-accumulate circuit according to claim 1, wherein the multiply-accumulate circuit is configured to add each of the product partial words to a corresponding one of the partial words of the accumulation word, wherein a number of partial words of the accumulation word is equal to a number of product partial words.
4. The multiply-accumulate circuit according to claim 1, wherein: the multiply-accumulate circuit is configured to add each of the product partial words to a corresponding one of the partial words of the accumulation word, wherein a number of partial words of the accumulation word is equal to a number of product partial words, or to add a plurality of the product partial words to the same partial word of the accumulation word, wherein the accumulation word is formed from only one partial word.
5. The multiply-accumulate circuit according to claim 4, wherein the multiply-accumulate circuit is configured to add the product partial words with at least partially different weight factors to the corresponding partial word of the accumulation word.
6. The multiply-accumulate circuit according to claim 4, wherein the multiply-accumulate circuit is configured to add a plurality of the product partial words to the same partial word of the accumulation word, wherein the product partial words which are added to the same partial word of the accumulation word are added with at least partially different weight factors, wherein the accumulation word is formed from only one partial word.
7. The multiply-accumulate circuit according to claim 1, wherein the multiply-accumulate circuit is configured to form a summand word including the product partial words that are to be added with partial words of the accumulation word.
8. The multiply-accumulate circuit according to claim 2, wherein the multiply-accumulate circuit is configured to determine the product partial words in parallel for at least two of the plurality of implemented permutation possibilities, and, from the product partial words, to select in accordance with the permutation signal, the product partial words which correspond to one or more of the permutation possibilities and to use them in the addition with the accumulation word.
9. The multiply-accumulate circuit according to claim 8, comprising: a permutation unit that implements the at least two permutation possibilities and is configured to determine from a first input word a permutation word that is formed from the partial words of the first input word, wherein an order of the partial words of the first input word is optionally permutated according to the permutation selection; a multiplication unit which is configured to multiply partial words of the permutation word with partial words of the second input word in order to determine the product partial words, and to determine a summand word including the product partial words; and an accumulation unit configured to add each of the product partial words included in the summand word to one of the one or more partial words of the accumulation word to determine the updated accumulation word.
10. The multiply-accumulate circuit according to claim 9, wherein: the multiplication unit is configured to multiply the permutation word with the second input word as a whole in order to determine the summand word; and the accumulation unit is configured to add the summand word with an accumulation word as a whole in order to determine the updated accumulation word.
11. The multiply-accumulate circuit according to claim 9, wherein the accumulation unit includes an adder configured to add the summand word and the accumulation word to determine the updated accumulation word.
12. The multiply-accumulate circuit according to claim 9, further comprising a first input register configured to store the first input word, and/or a second input register configured to store the second input word, and/or an accumulation register configured to store the accumulation word.
13. The multiply-accumulate circuit according to claim 9, further comprising a bit shift unit configured to, corresponding to a shift signal, shift the accumulation word as a whole by a number of bits that can be predetermined by the shift signal in order to determine partial words of the accumulation word, and/or to shift each of the partial words of the accumulation word by a number of bits that can be predetermined by the shift signal.
14. The multiply-accumulate circuit according to claim 13, further comprising a result register configured to receive from the bit shift unit and store partial words of the accumulation word or parts of the partial words of the accumulation word.
15. A method for performing multiply-accumulate operations using a multiply-accumulate circuit configured to process numerical values that are present as input words, each of which is formed from at least two partial words, the circuit being configured to: form product partial words as products of each partial word of the first input word of the input words with a different partial word of the second input word of the input words, wherein the partial words of the first input word are permutated relative to their original order corresponding to a permutation selected from a plurality of permulation possibilities implemented by the multiply-accumulate circuit; and add the product partial words with an accumulation word, which is formed from one or more partial words, in order to determine an updated accumulation word in which product partial words are in each case added to one of the one or more partial words of the accumulation word; the method comprising the following steps: reading a first word to be processed and a second word to be processed from an external memory; controlling the multiply-accumulate circuit to perform a first multiply-accumulate operation, wherein the first word to be processed is used as the first input word, the second word to be processed is used as the second input word, and a first permutation selection is made from the plurality of permutation possibilities; reading a third word to be processed from the external memory; controlling the multiply-accumulate circuit to perform a second multiply-accumulate operation, wherein the first word to be processed is used as the first input word, the third word to be processed is used as the second input word, and a second permutation selection is made from the plurality of permutation possibilities that differs from the first permutation selection; and reading out the accumulation word or partial words of the accumulation word.
16. The method according to claim 15, further comprising resetting the accumulation word to a predetermined value which is equal to zero, before the first multiply-accumulate operation.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0016]
[0017]
[0018]
DETAILED DESCRIPTION OF EXAMPLE EMBODIMENTS
[0019]
[0020] The multiply-accumulate circuit shown (also referred to in simplified form as a MAC circuit or MAC unit) comprises a first input register 10, a second input register 12, a permutation unit 14, a multiplication unit 16, an accumulation unit 18, a bit shift unit 20, and a result register 22.
[0021] The MAC circuit is used to process numerical values that are present as digital values (e.g., binary values) referred to as words or bit words. The words to be processed (referred to as input words) have a specific (first) bit number or length, which is, for example, equal to n. Each word includes two partial words (a.sub.1, a.sub.0 or b.sub.1, b.sub.0), i.e., is formed by two partial words. The partial words each have a specific bit number or length, which is equal to n/2, for example. The input words (if they are in each case regarded as a whole as numerical values) or their partial words (if these are in each case regarded as independent numerical values) can have any desired number format, for example binary values or integer numbers or floating-point numbers.
[0022] The input registers 10, 12 are configured to store or cache numerical values that are to be processed. Here, the first input register 10 stores a first input word (with partial words a.sub.1, a.sub.0) and the second input register 12 stores a second input word (with partial words b.sub.1, b.sub.0).
[0023] In the example shown, the input words are formed by two partial words with the same bit number. In general, the input words can be formed by more than two partial words and/or the partial words of an input word have different bit numbers.
[0024] The permutation unit 14 is connected to the first input register 10 and is configured, optionally in accordance with a permutation selection 24 or a permutation signal (permutation configuration signal), to swap (or not to swap) the partial words a.sub.1, a.sub.0 of the first input word. That is to say, the permutation unit 14 determines a permutation word 26 which is formed from the partial words of the first input word, wherein the order of the partial words in the first output word is optionally unchanged or changed compared with the order of the partial words in the first input word. In the example shown, with only two partial words, there are two selection possibilities or permutation possibilities: no swapping of the partial words or swapping of the partial words. In the case of more than two partial words, there may be more than two selection possibilities or permutation possibilities (for example, in the case of three partial words, up to six selection possibilities). The permutation signal or the permutation selection 24 thus results in a configuration of the permutation unit 14 corresponding to one of the respectively existing selection possibilities or permutation possibilities, i.e., corresponding to a permutation selection. The MAC circuit or its permutation unit 14 is configured such that there are at least two (mutually different) selection possibilities (permutation possibilities) or the permutation selection includes at least two selection possibilities (permutation possibilities).
[0025] The term permutation is used in the sense that the trivial case, i.e., no swapping takes place, can be included.
[0026] Accordingly, the case in which no swapping takes place can represent a permutation possibility. The permutation unit can be configured such that all mathematically possible permutations are realized or can be configured such that only a part of all mathematically possible permutations is realized (but, as mentioned, at least two permutation possibilities should be realized).
[0027] The multiplication unit 16 is connected to the permutation unit 14 and the second input register 12 and is configured to multiply the permutation word 26 determined by the permutation unit 14 by the second input word stored in the second input register 12 in order to determine a summand word 28. In this case, optionally in accordance with a multiplication selection, for example a signal which is designated as multiplication signal 30, the permutation word 26 and the second input word are multiplied together as a whole or on a partial word basis. Multiplication as a whole should mean that the permutation word and the second input word are each regarded as a whole as numerical values that are multiplied by one another. In this case, the summand word 28 is accordingly the product of these two numerical values. The term multiplication is intended to mean that mutually corresponding partial words of the permutation word and of the second input word are interpreted as (independent) numerical values and are multiplied together. In this case, the summand word 28 comprises a plurality of (in this case two) partial words (also referred to as product partial words) which correspond to the products of the corresponding partial words of the permutation word and of the second input word (or are these products). The summand word 28 can (for instance, if integer numbers are used), as shown, have a higher number of bits (here 2n) than the permutation word 26 or the second input word.
[0028] The case may also be provided in which the multiplication unit 16 always carries out a multiplication on a partial-word basis, i.e., is only configured to perform the multiplication of the permutation word by the second input word on a partial-word basis. The optional multiplication as a whole thus represents an optional embodiment. If this is not provided, the multiplication signal 30 or the multiplication selection can be omitted.
[0029] The accumulation unit 18 illustrated comprises an adder 32 and an accumulation register 34. The adder 32 is configured to add the summand word 28 determined by the multiplication unit 16 and an accumulation word stored in the accumulation register 34, which accumulation word includes two or generally a plurality of partial words acc.sub.1, acc.sub.0, and to store the resulting summand word as an updated accumulation word in the accumulation register 34. The summand word and the stored accumulation word can optionally be added, e.g., in accordance with an accumulation signal 31 or an accumulation selection, as a whole (the summand word and the accumulation word are interpreted as numerical values) or on a partial-word basis (product partial words or partial words of the summand word and partial words of the accumulation word are in each case interpreted as independent numerical values). In the latter case, mutually corresponding partial words of the summand word and of the stored accumulation word are added together (i.e., product partial words and partial words of the accumulation word which correspond to one another are added together), in order to determine the partial words aac.sub.1, aac.sub.0 of the updated accumulation word. As in the multiplication unit, the case may be provided in which the accumulation unit is configured such that an addition only on a partial-word basis takes place, i.e., that the addition as a whole and accordingly the configuration possibility with the accumulation signal is not provided. In addition, a reset signal can be provided with which the accumulation register 34, i.e., the accumulation word or its partial words, can be reset to a predetermined value, in particular zero.
[0030] The multiplication signal 30 and the accumulation signal 31 are regarded above as different control signals. However, it may also be the case that they coincide (e.g., as a processing signal) when there are clearly mutually corresponding selection possibilities in the multiplication unit and the accumulation unit.
[0031] Depending on whether or not the partial words of the first input word are swapped in the permutation unit 14, different operations are performed (for the same input words) or different accumulation values are determined.
[0032] If a swapping of the partial words takes place, there results (wherein by aac.sub.1, aac.sub.0 on the left in the equations, the updated partial word is to be understood and on the right the partial word before the update):
[0033] If the partial words are not swapped, the following results:
[0034] It can be seen that the partial words a.sub.1, a.sub.0 in the expressions have been swapped. In order to obtain these expressions, it is assumed that the multiplication unit 16 and the accumulation unit 18 or its adder 32 perform a multiplication or addition on a partial-word basis.
[0035] The bit shift unit 20 (or shift unit) is connected to the accumulation unit 18 or to its accumulation register 34 and is configured to receive the accumulation word acc.sub.1, acc.sub.0, i.e., to read the accumulation register 34, optionally corresponding to a shift signal 36 or a shift selection, and to store a specific section (e.g., the n lowest-value or highest-value bits) of the optionally shifted accumulation word in the result register 22. By means of the shift signal 36, the mode in which the shift is to take place and/or a number of bits by which the shift is to be effected can be specified. The mode can, for example, be a shift of the entire accumulation word by a number of bits which, for example, is specified in the shift signal, or a (parallel) shift of each of the partial words can be a number of bits which, for example, are specified in the shift signal (wherein partial words are not pushed into other partial words, i.e., for example, bits of a low-value partial word are not pushed into bits of a higher-value bit). The specification of by how many bits are shifted can also be made, for example, by the shift signal 36, via the selection from least one predefined (i.e., implemented in hardware) number of bits. Depending on the design, embodiments are of course also conceivable in which only a single mode is implemented and/or is always shifted by the same number of bits; in this case or in these cases a shift signal can be dispensed with, or the shift signal can only trigger the shift.
[0036] In other words, partial words of the accumulation word can be extracted and stored in the result register 22 as the corresponding result word. The result word itself can in turn also be regarded as being formed from partial words cy, Co. For example, by sequentially controlling the bit shift unit 20 in accordance with the mode mentioned above as the first mode, with the accumulation word unchanged, to shift the accumulation word by a different number of bits in each case (in the MAC circuit illustrated, for example, once by 0 bits and once by n bits), the partial words of the accumulation word can be stored sequentially in the result register 22 and read out from there. In the mode mentioned above as the second mode, the partial words can be read out simultaneously or in parallel.
[0037] The multiply-accumulate circuit can be controlled, for example, by means of a control circuit (not shown). The control circuit is configured in particular to write input words (which are read from a memory by the control circuit, for example) into the first and the second input register 10, 12; to generate control signals for the multiply-accumulate circuit and to thereby control them; and to read out results from the result register 22 (and store them, for example, in a memory). The control signals are, for example, the permutation signal (corresponding to the permutation selection 24) for the permutation unit 14, the processing signal (corresponding to the processing selection 30) for the multiplication unit 16 and the accumulation unit 18 or its adder 32, and the shift signal (corresponding to the shift selection 36) for the bit shift unit 20. The control circuit can in particular be configured to carry out a method according to the present invention, as shown, for example, in
[0038] In the MAC circuit of
[0039] In contrast, it is likewise possible for the functionalities or parts of the functionalities to be realized jointly or partially jointly in corresponding units, wherein even a different chronological sequence of functionalities or of parts of the functionalities can be implemented at least partially. For example, it could be provided that initially all possible products of partial words of the first input word with partial words of the second input word or at least all products of partial words of the first input word with partial words of the second input word that appear in the permutation possibilities implemented by the MAC circuit are determined and from these products the products that correspond to the permutation selection predetermined by the permutation signal are selected and added to the accumulation word. For this purpose, the MAC circuit could comprise corresponding execution units for the product formations. One or more intermediate registers could possibly be provided into which certain products are loaded in sequence in accordance with the permutation signal in order to be added to one partial word or in parallel to a plurality of partial words of the accumulation word. In this way, for example, different types of accumulation can be realized (wherein a selection of those to be used is made with the accumulation signal).
[0040] Furthermore, as regards all embodiments, a different weighting of the product partial words can also be provided, wherein before accumulation the product partial words are multiplied by a weight factor that can be different for different product partial words.
[0041]
[0042]
[0043] The input arrays a.sub.0, a.sub.1 have, for example, in each case nine entries: a.sub.i={a.sub.i,j}, wherein i=0, 1 and j=0, 1, . . . , 8. Mutually corresponding entries in the input arrays (i.e., with the same subscript j) can be understood as partial words of words.
[0044] The weight arrays h.sub.0, h.sub.1, k.sub.0, k.sub.1 have, for example, in each case four entries: h.sub.1={h.sub.i,j} or k.sub.i={k.sub.i,j}, wherein i=0, 1 and j=0, 1, 2, 3. Here, mutually corresponding entries in the weight arrays h.sub.0 and k.sub.1 or weight arrays h.sub.1 and k.sub.0 can be understood as partial words of words. The storage of these words is shown in
[0045] The output arrays c.sub.0, c.sub.1 have, for example, in each case 4 entries: c.sub.i={c.sub.i,j}, wherein i=0, 1 and j=0, 1, 2, 3. Mutually corresponding entries in the output arrays can be understood as partial words of words. The storage of these words is again shown in
[0046] The entry c.sub.1,1 c.sub.0,1 is determined by the convolution as follows, wherein
With no swapping:
With swapping:
[0047] In order to determine, as desired, c.sub.0=a.sub.0*h.sub.0+a.sub.1*h.sub.1 and c.sub.1=a.sub.0*k.sub.0+a.sub.1*k.sub.1, it is therefore necessary to determine c.sub.0,1.sup.U+c.sub.0,1.sup.V and c.sub.1,1.sup.U+c.sub.1,1.sup.V. For this purpose, the proposed MAC circuit can expediently be used. First, the accumulation word or its partial words are reset to zero. Then, the word consisting of the partial words a.sub.1,0, a.sub.0,0 is loaded as the first input word and the word consisting of the partial words k.sub.1,0, h.sub.0,0 is loaded as the second input word and the multiplication and the addition to the accumulation value are performed on a partial-word basis (here the partial words of the first input word are not swapped).
[0048] Next, the partial words of the first input word are swapped (without reloading) and the word consisting of the partial words k.sub.0,0, h.sub.1,0 is loaded as the second input word, and the multiplication and the addition to the accumulation value are performed on a partial-word basis. As a result, the first summands of the above expressions for c.sub.0,1.sup.U, c.sub.1,1.sup.U, c.sub.0,1.sup.V, c.sub.1,1.sup.V are determined as well as suitably added. This is repeated analogously for the further words of the input array and of the weight arrays in order to determine the further summands of the above expressions for c.sub.0,1.sup.U, c.sub.1,1.sup.U, c.sub.0,1.sup.V, c.sub.1,1.sup.V (without resetting in the meantime the accumulation word or its partial word to zero).
[0049] By permuting the partial words of the first input word, memory accesses can be avoided. Overall, the calculation can thus be accelerated and the energy consumption associated with memory accesses can be reduced. The more partial words the input words have, the greater this effect will be. Partial words can describe different properties of data to be processed, for example different color channels of a digital image.
[0050]
[0051] In step 110, a first word to be processed and a second word to be processed are read from an external memory (i.e., a circuit-external memory). The words to be processed are each formed from partial words. In addition to the external memory, a local memory can be provided which, for example, is integrated with several MAC circuits on a chip and has a relatively short access time (in contrast to the external memory).
[0052] In an optional step 120, which can also be carried out before or at least partially simultaneously with step 110, the accumulation word or the accumulation register is initialized, i.e., the accumulation word or its partial words are reset to a predetermined value, which is in particular zero. In the further course of the method, the accumulation word is not reset.
[0053] In step 130 (which is optionally performed after step 120), the multiply-accumulate circuit is controlled to perform a first multiply-accumulate operation, wherein the first word to be processed is used as the first input word, the second word to be processed is used as the second input word, and a first permutation selection is made from the at least two permutation possibilities.
[0054] Next or at least after the multiplication on a partial-word basis has been carried out in step 120, a third word to be processed (which is formed from partial words) is read from the external memory in step 140.
[0055] In step 150, the multiply-accumulate circuit is controlled to perform a second multiply-accumulate operation, wherein the first word to be processed is used as the first input word (the first word to be processed can, for example, remain in the first input register or be downloaded from the local memory), the third word to be processed is used as the second input word, and a second permutation selection is made from the at least two permutation possibilities that differs from the first permutation selection. As already mentioned, the accumulation word is not reset prior to step 150, unlike as before step 130.
[0056] Steps 140 and 150 can be repeated for further words to be processed which are used in the relevant pass as the second input word, in particular if there are more than two permutation possibilities for the first input word, i.e., if this is formed from more than two partial words.
[0057] In step 160, the accumulation word or partial words of the accumulation word is read out in order to obtain the final result of the multiply-accumulate operations.