CONFIGURABLE SIMD MULTIPLICATION CIRCUIT
20200057609 ยท 2020-02-20
Inventors
- Michael Alexander KENNEDY (Cambridge, GB)
- Neil Burgess (Cardiff, GB)
- Zichao Xie (Cambourne, GB)
- Karel Hubertus Gerardus Walters (Cambridge, GB)
Cpc classification
G06F9/30036
PHYSICS
G06F7/5336
PHYSICS
International classification
G06F15/80
PHYSICS
Abstract
A configurable SIMD multiplication circuit is provided to perform multiplication on a multiplicand operand M and multiplier operand R with varying data element sizes supported. For each result element generated based on corresponding elements of the multiplicand operand M and the multiplier operand R, the multiplication is performed according to radix-N modified Booth multiplication, where N=2.sup.P and P3. A Booth digit selection scheme is described for improving the efficiency with which higher radix modified Booth multiplication can be implemented in a configurable SIMD multiplier.
Claims
1. An apparatus comprising: a configurable SIMD multiplication circuit to perform multiplication on a multiplicand operand M and a multiplier operand R to generate a result value; and control circuitry responsive to a multiplication command specifying a selected element size from a plurality of element sizes supported by the configurable SIMD multiplication circuit, to control the configurable SIMD multiplication circuit to generate the result value in which each of one or more result elements within the result value has a value corresponding to the product of a corresponding multiplicand element of the multiplicand operand M and a corresponding multiplier element of the multiplier operand R, said corresponding multiplicand element having the selected element size; in which: for each of said plurality of element sizes supported by the configurable SIMD multiplication circuit, the configurable SIMD multiplication circuit is configured to generate each result element of the result value using radix-N modified Booth multiplication of the corresponding multiplicand element and the corresponding multiplier element, where N=2.sup.P and P3.
2. The apparatus according to claim 1, in which a minimum element size supported by the configurable SIMD multiplication circuit is E.sub.min bits, and E.sub.min modulo P is non-zero.
3. The apparatus according to claim 1, in which a minimum element size supported by the configurable SIMD multiplication circuit is E.sub.min bits; and in response to a multiplication command for which the selected element size is greater than E.sub.min bits, the configurable SIMD multiplication circuit is configured to perform said radix-N modified Booth multiplication with a Booth encoding applied separately to each E.sub.min-bit portion of the multiplicand operand M.
4. The apparatus according to claim 1, in which the configurable SIMD multiplication circuit comprises: Booth digit selection circuitry to select a given number A of Booth digits BD.sub.0 to BD.sub.A1, each Booth digit based on a respective bit portion of the multiplicand operand M; partial product generation circuitry to generate a plurality of partial products for each result element of the result value, each partial product comprising a multiple of the corresponding multiplier element selected based on a respective one of the Booth digits for which the bit portion used to select the Booth digit is within the corresponding multiplicand element; and an adder to generate each result element of the result value by adding the plurality of partial products generated for that result element by the partial product generation circuitry.
5. The apparatus according to claim 4, in which the bit portions of the multiplicand operand M used by the Booth digit selection circuitry to select the Booth digits are at the same bit positions within the multiplicand operand M regardless of which of the plurality of element sizes is the selected element size.
6. The apparatus according to claim 4, in which a mapping between bit values of said respective one of the Booth digits and which multiple of the corresponding multiplier element is selected by the partial product generation circuitry is independent of which of said plurality of element sizes is the selected element size.
7. The apparatus according to claim 4, in which a mapping between bit values of said respective one of the Booth digits and which multiple of the corresponding multiplier element is selected by the partial product generation circuitry is independent of a relative position between the bit portion of the multiplicand operand M used to select the Booth digit and an element boundary between respective elements of the selected element size within the multiplicand operand M.
8. The apparatus according to claim 4, in which the Booth digit selection circuitry is configured to select the Booth digits BD.sub.0 to BD.sub.A1, where: Booth digit BD.sub.0 is selected based on a least significant bit portion of the multiplicand operand M, Booth digit BD.sub.A1 is selected based on a most significant bit portion of the multiplicand operand M, Booth digit BD.sub.i, where 1iA1, is selected based on a bit portion of the multiplicand operand M having a most significant bit SH bit positions more significant than a most significant bit of the bit portion used to select Booth digit BD.sub.i1, and SH has a different value for at least two values of i less than A1.
9. The apparatus according to claim 8, in which the Booth digit selection circuitry is configured to select the Booth digits with SH<P for at least one value of i less than A1.
10. The apparatus according to claim 8, in which the Booth digit selection circuitry is configured to select the Booth digits with SH having one of two values P and E.sub.min modulo P for each value of i less than A1.
11. The apparatus according to claim 8, in which the multiplicand operand M comprises T bits, where T=q*E.sub.min and E.sub.min is a minimum element size supported by the configurable SIMD multiplication circuit; and the Booth digit selection circuitry is configured to select Booth digit BD.sub.k, where (A/q)kA1, based on a bit portion of the multiplicand operand M having a most significant bit E.sub.min bit positions more significant than a most significant bit of the bit portion used to select Booth digit BD.sub.kA/q.
12. The apparatus according to claim 4, in which the multiplicand operand M comprises T bits, where T=q*E.sub.min and E.sub.min is a minimum element size supported by the configurable SIMD multiplication circuit; and said given number A of Booth digits comprise q partitions of Booth digits, each partition of Booth digits selected based on bit portions in a corresponding sub-portion of size E.sub.min within the multiplicand operand M.
13. The apparatus according to claim 12, in which said given number A of Booth digits satisfies
14. The apparatus according to claim 12, in which for a least significant Booth digit of each partition, the Booth digit selection circuitry is configured to set a least significant bit of the least significant Booth digit to 0 regardless of which of said plurality of element sizes is the selected element size, and to set remaining bits of the least significant Booth digit based on a least significant bit portion of the corresponding sub-portion of the multiplicand operand M.
15. The apparatus according to claim 12, in which for a most significant Booth digit of each partition, the Booth digit selection circuitry is configured to generate the most significant Booth digit based on a sign extension or zero extension of a most significant bit portion of the corresponding sub-portion of the multiplicand operand M, and a bit position of sign-extended or zero-extended bits within the most significant Booth digit is the same regardless of which of said plurality of elements is the selected element size.
16. The apparatus according to claim 1, in which N=8.
17. An apparatus comprising: means for performing a configurable SIMD multiplication on a multiplicand operand M and a multiplier operand R to generate a result value; and means for controlling, in response to a multiplication command specifying a selected element size from a plurality of element sizes supported by the configurable SIMD multiplication circuit, the means for performing the configurable SIMD multiplication to generate the result value in which each of one or more result elements within the result value has a value corresponding to the product of a corresponding multiplicand element of the multiplicand operand M and a corresponding multiplier element of the multiplier operand R, said corresponding multiplicand element having the selected element size; in which: for each of said plurality of element sizes, the means for performing the configurable SIMD is configured to generate each result element of the result value using radix-N modified Booth multiplication of the corresponding multiplicand element and the corresponding multiplier element, where N=2.sup.P and P3.
18. A data processing method comprising: receiving a multiplication command specifying a selected element size from a plurality of element sizes supported by a configurable SIMD multiplication circuit; in response to the multiplication command, controlling the configurable SIMD multiplication circuit to perform multiplication on a multiplicand operand M and a multiplier operand R to generate a result value in which each of one or more result elements within the result value has a value corresponding to the product of a corresponding multiplicand element of the multiplicand operand M and a corresponding multiplier element of the multiplier operand R, said corresponding multiplicand element having the selected element size; in which: for each of said plurality of element sizes, the configurable SIMD multiplication circuit generates each result element of the result value using radix-N modified Booth multiplication of the corresponding multiplicand element and the corresponding multiplier element, where N=2.sup.P and P3.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0016]
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
[0027]
DESCRIPTION OF EXAMPLES
[0028] A configurable SIMD multiplication circuit is provided to perform multiplication on a multiplicand operand M and a multiplier operand R to generate a result value. Control circuitry is provided which, in response to a multiplication command specifying a selected element size from among a plurality of element sizes supported by the configurable SIMD multiplication circuit, controls the configurable SIMD multiplication circuit to generate the result value in which each of one or more result elements within the result value has a value corresponding to the product of a corresponding multiplicand element of the multiplicand operand M and a corresponding multiplier element of the multiplier operand R, where the corresponding multiplicand element has the selected element size. By using a configurable SIMD (single instruction, multiple data) multiplier, multiplications across a range of data element sizes can be supported by a single circuit. When smaller element sizes are needed, multiple independent multiplications can be performed on respective elements of the operands in response to the same instruction. By using a SIMD approach, this can reduce the total number of instructions needed to carry out a certain number of multiplications. In the technique discussed below, for each of the element sizes supported by the configurable SIMD multiplication circuit, each result element of the result value is generated using radix-N modified Booth multiplication of the corresponding multiplicand element and the corresponding multiplier element, where N=2.sup.P and P3. Most conventional approaches to Booth multiplication may use a lower radix Booth multiplication, such as radix-4 Booth multiplication. By using a higher radix form of Booth multiplication, e.g. radix-8 or higher, the number of partial products needed to be added to form the result can be reduced, so as to improve the performance of the multiplier. One might expect that higher-radix-modified Booth multiplication would be unsuitable for a configurable SIMD multiplication circuit which supports multiplication across a range of different data element sizes, because conventional approaches to digit selection involved in such higher radix Booth multiplication would not fit well with the positions of element boundaries in a SIMD scenario, so that it might be expected that extremely complex logic would be required to support the different element size of options in the context of the higher radix Booth operation. However, the inventors recognised that this complexity can be mitigated by an adjustment to the Booth digit selection process so that it becomes feasible to implement higher radix Booth multiplication in a configurable SIMD multiplication circuit in an efficient manner. Hence, by providing a configurable SIMD multiplication circuit which uses radix-8 modified Booth multiplication or higher, multiplication can be performed with higher performance across a range of different data element sizes in a SIMD environment.
[0029] The configurable SIMD multiplication circuit may support a range of element sizes where the minimum element size supported is E.sub.min bits. E.sub.min could be 8 or 16 bits for example. The value of P may be such that E.sub.min modulo P is non-zero. For example, if E.sub.min is a power of 2 number of bits as in typical SIMD multiplication circuits, then values of P which satisfy E.sub.min modulo P being non-zero may include 3, 5, 6, 7, etc. The approach discussed below can be particularly useful for cases when E.sub.min modulo P is non-zero since this may result in a varying phase relationship between the positions of Booth digits selected in the Booth multiplication process and the element boundaries for the SIMD multiplier if conventional digit selection schemes are used. The technique discussed below is able to handle cases when E.sub.min modulo P is non-zero without complex logic for handling digit selection at element boundaries.
[0030] In response to a multiplication command for which the selected element size is greater than E.sub.min bits, the configurable SIMD multiplication circuit may perform the radix-N modified Booth multiplication with a Booth encoding applied separately to each E.sub.min-bit portion of the multiplicand operand M. Hence, even though the multiplicand elements within the multiplicand operand M have a size greater than E.sub.min bits, the Booth encoding is still applied separately to each E.sub.min bit portion of the multiplicand operand. This approach works because a multiplication MR can be performed with one of the values being multiplied being partitioned into separate sections which are multiplied separately with the other value and then added (e.g. in decimal notation, a multiplication 154236902 may be equivalent to adding 154000902+236902, without changing the result). Hence, by effectively partitioning the larger elements of the multiplicand operand into E.sub.min-bit portions and separately Booth encoding each E.sub.min-bit portion, while this could result in some additional partial products, this may simplify the circuit implementation because the positions of Booth digits generated in the Booth encoding of a given E.sub.min-bit portion of the multiplicand operand M can be the same regardless of which particular element size is currently selected as the selected element size. This reduces the circuit area required for the processor circuit logic. This approach may be counter intuitive since partitioning larger values in this way may for some operations result in a greater number of partial products being required to be added, which could be seen by a skilled person to reduce performance or increase circuit area in the adder used to add the partial products. However, the inventors recognised that the benefit of reduced circuit area and logic complexity in generating the Booth encoded partial products may outweigh the relatively small increase in area in the adder caused by the increased number of partial products required for some element sizes. Hence, this allows a more efficient implementation of the configurable SIMD multiplication circuit which uses higher radix modified Booth multiplication.
[0031] The configurable SIMD multiplication circuit may include a number of components. The multiplication circuit may include Booth digit selection circuitry to select a given number A of Booth digits BD.sub.0 to BD.sub.A1. Each Booth digit is selected based on a respective bit portion of the multiplicand operand M. The SIMD multiplication circuit also includes partial product generating circuitry which generates a number of partial products for each result element of the result value. Each partial product comprises a multiple of the corresponding multiplier element, with that particular multiple being selected based on a respective one of the Booth digits for which the bit portion used to select the Booth digit is within the corresponding multiplicand element. An adder is provided to generate each result element of the result value by adding the plurality of partial products generated for that result element by the partial product generation circuitry. The Booth digit selection performed by the Booth digit selection circuitry may be based on a Booth encoding which is applied separately to each E.sub.min-bit portion of the multiplicand operand M regardless of the selected element size, as discussed above.
[0032] The bit portions of the multiplicand operand M used by the Booth digit selection circuitry to select the Booth digits may be at the same bit positions within the multiplicand operand M regardless of which of the element sizes has been selected as the selected element size to be used for the current multiplication operation being performed. Hence, there is no need to vary, based on the selected element size, the positions of the bits from which each Booth digits are selected. This makes logic implementation much simpler.
[0033] The partial product generation circuitry may include a multiple selector which selects, for each partial product, the multiple of the corresponding multiplier element which is to be used for generating the partial product, based on a respective one of the Booth digits. A mapping between bit values of a given one of the Booth digits and which multiple of the corresponding multiplier element is selected by the partial product generation circuitry based on the given Booth digit may be independent of which of the element sizes is used as the selected element size for the current operation being performed. That is, since the bit positions at which the Booth digits are selected from the multiplicand operand are selected so that the positions are the same for each E.sub.min bit portion regardless which data element size is the selected element size, this means that the multiple selector does not need to consider different versions of each multiple depending on the phase relationship between the selected Booth digits and the positions of element boundaries. This means that multiplexers of a smaller size can be used since the number of different options available for selecting as the multiple of the multiplier element to be used for generating the partial product can be fewer. This may be particularly useful for higher radix Booth multiplication schemes since in radix-8 Booth multiplication, for example, the number of different options for the multiple increases (4 to +4) compared to the options for radix-4 Booth multiplication (2 to +2). Hence, by implementing the Booth digit selection so that it is not necessary to select from more than one option for each one of the (4 to +4) multiples, this avoids a significant increase in the complexity of the multiplexer circuit logic and reduces processing delay through the multiplexer. A mapping between bit values of the respective one of the Booth digits and which particular multiple of the multiplier element is selected by the partial product generation circuitry may be independent of a relative position between the bit portion of the multiplicand operand used to select the Booth digits and an element boundary between respective elements of the selected element size within the multiplicand operand M. The Booth digit selection circuitry may select the Booth digits BD.sub.0 to BD.sub.A1 such that: Booth digit BD.sub.0 is selected based on a least significant bit portion of the multiplicand operand M; Booth digit BD.sub.A1 is selected based on a most significant bit portion of the multiplicand operand M; Booth digit BD.sub.i, where 1iA1, is selected based on a bit portion of the multiplicand operand M having a most significant bit SH bit positions more significant than a most significant bit of the bit portion used to select Booth digit BD.sub.i1; and SH has a different value for at least two values of i less than A1. In comparative Booth digit selection schemes, the offset SH between the bit positions at which neighbouring Booth digits are selected from the multiplicand operand M may be constant for all Booth digits other than the most significant Booth digit BD.sub.A1. This can provide the minimum number of partial products required to calculate the product of M*R when the selected element size is the maximum data element size. However, in the technique discussed below, SH may have a different value for at least two values of i less than A1, there are some Booth digits which are offset by a smaller amount compared to other Booth digits, for at least one position within the multiplicand operand M which is not at the most significant end of the operand. By using a variable offset between the positions at which the Booth digits are selected from the multiplicand operand M, this can allow the Booth digit selection pattern to be the same for each E.sub.min-bit portion of the multiplicand operand so that the Booth digit selection does not need to vary depending on the selected element size.
[0034] The Booth digit selection circuitry may select the Booth digits with SH being less than P for at least one value of i less than A1. Hence, whereas conventional digit selection schemes would have SH equal to P for each Booth digit BD.sub.1 to BD.sub.A2 (other than the most significant Booth digit BD.sub.A1), with the approach discussed below the offset between adjacent Booth digits is less than P for at least one of the intermediate Booth digits BD.sub.1 to BD.sub.A2. Again this helps reduce the complexity of handling the different element sizes of the configurable SIMD multiplier.
[0035] In one approach the Booth digit selection circuitry may select the Booth digits so that SH has one of two values: P and (E.sub.min modulo P), for each value of i less than A1. This approach can help to reduce the number of partial products required compared to the case if other values were used for SH, hence improving performance.
[0036] In one example, the multiplicand operand M comprises T bits, where T=q*E.sub.min and E.sub.min is a minimum element size supported by the configurable SIMD multiplication circuit; and the Booth digit selection circuitry is configured to select Booth digit BD.sub.k, where (A/q)kA1, based on a bit portion of the multiplicand operand M having a most significant bit E.sub.min bit positions more significant than a most significant bit of the bit portion used to select Booth digit BD.sub.kA/q. Hence, the offset between the positions at which Booth digits BD.sub.k and Booth digits BD.sub.kA/q are selected may be E.sub.min bit positions, where E.sub.min is the minimum element size supported by the multiplication circuit. This approach means that the pattern of Booth digit selection will be consistent for each E.sub.min-bit portion of the multiplicand operand, making the circuit logic much simpler.
[0037] The multiplicand operand M may comprise T bits, where T=qE.sub.min. The given number A of Booth digits selected may include q partitions of Booth digits where each partition of Booth digits is selected based on bit portions in a corresponding sub-portion of the size E.sub.min within the multiplicand operand. Again, by partitioning the multiplicand operand M into smaller E.sub.min bit portions even when the selected element size is greater than E.sub.min, this means that the partitions of the Booth digits may map to the sizes of the element sizes which could be selected for the SIMD multiplier to give a more consistent mapping and hence reduce the complexity of logic. The minimum number of Booth digits which can be generated may be
This can be achieved if SH is selected with one of two values (P and E.sub.min modulo P) as discussed above. However, some implementations could choose to generate additional Booth digits by introducing additional partitions (similar to the partitions at E.sub.min-bit boundaries as discussed below) without changing the correctness of the product result, and so in some cases A may be greater than or equal to
[0038] Hence, the Booth digit selection circuitry may generate q partitions of Booth digits, each partition corresponding to an E.sub.min bit portion of the multiplicand operand. The Booth digit selection circuitry may make an adjustment to the values of the least significant Booth digit of each partition and a most significant Booth digit of each partition, to account for the element size boundaries. However, this adjustment can be independent of which element size is selected as the selected element size for the current operation, because the positions of the Booth digits have been selected so that the partitions are being used regardless of the data element size so as to reduce the complexity. Hence, for a least significant Booth digit of each partition, the Booth digit selection circuitry may set a least significant bit of the least significant Booth digit to 0, regardless of which of the element sizes is the selected element size. Remaining bits of the least significant Booth digit of the partition may be based on a least significant bit portion of the corresponding sub-portion of size E.sub.min within the multiplicand operand M.
[0039] On the other hand, for a most significant Booth digit of each partition, the Booth digit selection circuitry may generate the most significant Booth digit based on a sign extension or zero extension of a most significant bit portion of the corresponding sub-portion of the multiplicand operand M, and a bit position of sign-extended or zero-extended bits within the most significant Booth digit is the same regardless of which of the elements is the selected element size. A sign extension is used at partition boundaries which correspond to an element boundary, while a zero extension is used at partition boundaries other than element boundaries. Hence, for the most significant Booth digit of the most significant partition within a given data element, the most significant bit portion of the corresponding sub-portion of multiplicand operand M is sign extended. For signed multiplications, the sign extension comprises extension with one or more bit values of the same value (0 or 1) as the most significant bit. For unsigned multiplications, the sign extension comprises extension with one or more bit values of 0. For the most significant Booth digit of a partition other than the most significant partition within a given data element, the most significant bit portion of the corresponding sub-portion of multiplicand operand M is zero extended (i.e. extended with bit values equal to 0 regardless of the value of the most significant bit).
[0040] Hence, these adjustments at the least and most significant ends of the partition account for the fact that the total number of bits of the Booth digits required for a given partition according to the higher radix modified Booth multiplication scheme may require more bits than are required in the E.sub.min-bit portion of the multiplicand operand, and so a zero is included at the lower end and a sign extension or zero extension at the upper end to account for this. However, by using the Booth digit selection scheme discussed above, the selection of which bit positions are filled with zeroes or sign extension bits can be the same regardless of the current element size, which greatly simplifies the logic.
[0041] While the present technique can be used for any higher radix modified Booth multiplication operation, particularly where E.sub.min modulo P is non-zero, it can be particularly useful when N=8 (i.e. P=3). As the radix used for Booth multiplication increases, the number of different multiples of the multiplier which have to be generated for selection based on the Booth digits increases, and the complexity of the multiple generation and selection logic may begin to outweigh the benefits of increased performance achieved by reducing the number of partial products which have to be added in higher radix operations. When N=8, the balance between circuit complexity in the partial product generation and performance in adding the partial products may be improved, as this may provide a good tradeoff between the number of multiples required for selection and the number of partial products required to be added. Hence, while the technique discussed below could be used for higher radix operations, such as radix-32 modified Booth multiplication or higher, in practice it may be particularly beneficial for a configurable SIMD multiplication circuitry which uses radix-8 modified Booth multiplication.
[0042] As discussed above, the configurable SIMD multiplication circuit may support a range of element sizes, where the selected element size refers to the size of the multiplicand elements of the multiplicand operand M which are multiplied with corresponding multiplier elements of the multiplier operand R. The multiplier elements could have the same size as the corresponding multiplicand elements, so that the multiplier elements also have the selected element size. However, it is also possible to design multiplication circuits which support multiplication of corresponding elements of different sizes, and in this case the multiplier elements could have a larger size or smaller size than the selected element size for the multiplicand element. For example, the SIMD multiplier could in some examples support multiplication of 16-bit multiplicand elements with 8-bit multiplier elements. Nevertheless, most common multiplication operations may have the same size for the elements of the multiplicand operand M and the multiplicand operand R respectively, and so some circuit implementations could be implemented so that the sizes of the elements in both operands are always the same.
[0043] Similarly, the relationship between the size of the result element in the result value and the selected element size could vary. The result elements could have the same size as the selected element size used for the multiplicand element, or could have a larger size. In general, a multiplication of two data values will produce a result with a larger number of bits. In some processing operations, the full product result may be required (e.g. multiplication of two X-bit values may generate a 2X-bit result). In some cases, to retain the full multiplication result, but preserve the same element size in the result value compared to the input operands, half the multiplication result in each lane of the SIMD multiplication could be written to one register and the other half of the result could be written to another register. Also, some SIMD multiplications could also support an operation where only half of the data elements of the multiplicand and multiplier operands M and R are considered as valid, and the result of multiplying those valid elements may be spread across two adjacent lanes of the result value (e.g. within a result vector, lane 0 could be set to the lower half of the result of multiplying the elements in lane 0 of the multiplicand and multiplier operands while lane 1 could be set to the upper half of the product of the element in lane 0 of the multiplicand and multiplier operands, with the input elements in lane 1 of the multiplicand and multiplier operands being ignored for this multiplication). Alternatively, other implementations or other instructions could support simply generating a result which only provides the lower half or the upper half of the multiplication result and discards the other half so that the result elements can still fit within a vector where the elements have the same size as in the input operands. Hence, it will be appreciated that there are a number of ways in which multiplication instructions can be implemented in a SIMD multiplier. Some SIMD circuits may support two or more of the different options discussed above. The Booth digit selection scheme can be applied regardless of which of these options is taken.
[0044]
[0045] The execute stage 16 includes a number of processing units, for executing different classes of processing operation. For example the execution units may include a scalar arithmetic/logic unit (ALU) 20 for performing arithmetic or logical operations on scalar operands read from a scalar register file 21; a floating point unit 22 for performing operations on floating-point values, a branch unit 24 for evaluating the outcome of branch operations and adjusting the program counter which represents the current point of execution accordingly; and a load/store unit 28 for performing load/store operations to access data in a memory system 8, 30, 32, 34. In this example the memory system include a level one data cache 30, the level one instruction cache 8, a shared level two cache 32 and main system memory 34. It will be appreciated that this is just one example of a possible memory hierarchy and other arrangements of caches can be provided. The specific types of processing unit 20 to 28 shown in the execute stage 16 are just one example, and other implementations may have a different set of processing units or could include multiple instances of the same type of processing unit so that multiple micro-operations of the same type can be handled in parallel. It will be appreciated that
[0046] One example of an operation which may be supported by the ALU 20 is a multiplication operation. In some systems, a dedicated execution unit called a multiply-accumulate (MAC) unit may be provided to handle multiplications since multiply-accumulate operations (where two operands are multiplied and the result is added to an accumulator value) may be a common operation in digital signal processing algorithms for example. The performance achieved on a multiplication operation can be an important factor in the overall performance achieved from processing some workloads.
[0047]
TABLE-US-00001 M[0] (32 bits) M[1] (16 bits) M[0] (16 bits) M[3] (8 bits) M[2] (8 bits) M[1] (8 bits) M[0] (8 bits)
[0048] The command 56 may specify registers 14 identifying a multiplicand operand M and a multiplier operand R. The selected element size defines the element size of the elements in the multiplicand operand M. The elements of the multiplier operand R could be the same size as the corresponding elements of the multiplicand operand M, or could be a different size. The ratio between the size of the multiplicand elements in operand M and the multiplier operands in operand R may be fixed (hardwired), or could be variable based on a second element size parameter specified with the commands 56. In the examples below, the size of the multiplier elements is assumed to be the same as the size of the multiplicand elements, but this is not essential.
[0049] In response to the command 56 the configurable SIMD multiplier generates a result value 60 which includes a number of results elements where each result element within the result value has a value corresponding to the product of the corresponding multiplicand element of the multiplicand operand M and the corresponding multiplier element of the multiplier operand R. E.g. with 4 8-bit elements in each of the multiplicand M and multiplier R, each multiplication of a corresponding pair of elements of M and R may produce a 16-bit result. Different approaches are available for handling the larger result, for example:
[0050] Example 1: only include lower half of result in each result element, and discard upper half.
TABLE-US-00002 Multiplicand M[3] M[2] M[1] M[0] Multiplier R[3] R[2] R[1] R[0] Result M[3]*R[3] M[2]*R[2] M[1]*R[1] M[0]*R[0] (lower 8 (lower 8 (lower 8 (lower 8 bits) bits) bits) bits)
[0051] Example 2: only include upper half of result in each result element, and discard lower half.
TABLE-US-00003 Multiplicand M[3] M[2] M[1] M[0] Multiplier R[3] R[2] R[1] R[0] Result M[3]*R[3] M[2]*R[2] M[1]*R[1] M[0]*R[0] (upper 8 (upper 8 (upper 8 (upper 8 bits) bits) bits) bits)
[0052] Example 3 specify upper and lower halves of result in different result registers:
TABLE-US-00004 Multiplicand M[3] M[2] M[1] M[0] Multiplier R[3] R[2] R[1] R[0] Result M[3]*R[3] M[2]*R[2] M[1]*R[1] M[0]*R[0] register 1 (upper 8 (upper 8 (upper 8 (upper 8 bits) bits) bits) bits) Result M[3]*R[3] M[2]*R[2] M[1]*R[1] M[0]*R[0] register 2 (lower 8 (lower 8 (lower 8 (lower 8 bits) bits) bits) bits)
[0053] Example 4: only consider half the input elements and spread result over adjacent lanes:
TABLE-US-00005 Multiplicand M[3] M[2] M[1] M[0] Multiplier R[3] R[2] R[1] R[0] Result M[2]*R[2] M[0]*R[0] (full 16 bits) (full 16 bits)
It will be appreciated that predication could be applied in some lanes of the SIMD operands, so that the multiplications in some lanes could be masked. The result values in mask lanes could be set to a fixed value such as 0, or could take the value which was previously stored in the register used to store the result value.
[0054] The examples above show possible operations for a selected element size of 8 bits, but the same configuration SIMD multiplier circuit 52 also supports operations at other element sizes, e.g. 16 bits:
Example 5
[0055]
TABLE-US-00006 Multiplicand M[1] (16 bits) M[0] (16 bits) Multiplier R[1] R[0] Result M[1]*R[1] M[0]*R[0] (lower or upper 16 bits) (lower or upper 16 bits)
Similar examples to Examples 3 and 4 could also be provided for the 16-bit operation.
[0056] Some circuit implementations of the configurable SIMD multiplier 52 could process all of the elements in parallel. Other implementations may support an instruction set architecture which supports wider SIMD vector length, but implement the instructions on hardware supporting a smaller maximum vector length. In this case, the instruction may be mapped to several micro-operations which each process a subset of the input elements to generate a subset of the result elements.
[0057] The configurable SIMD multiplier performs a multiplication according to a radix-N modified Booth multiplication command, where N=2.sup.P and P3. In particular, for the examples below radix-8 modified Booth multiplication (also known as Booth-3 multiplication) is used. To support the modified Booth multiplication the configurable SIMD multiplier 52 includes a multiple generator 62 for generating a range of multiple values based on the multiplier operand R. When the multiplier operand R includes only a single data element, the multiple generator 62 generates each multiple as a respective multiple of the multiplier operand R as a whole (e.g. multiples extending from 4*R to +4*R for radix-8 modified Booth multiplication). If the selected element size is not the maximum element size supported, then multiples of each individual data element within the multiplier operand R are generated by the multiple generator 62 (e.g. with the 16-bit example shown above, multiples 4*R[1] to +4*R[1], and multiples 4*R[0] to +4*R[0]).
[0058] The configurable SIMD multiplier 52 also includes Booth digit selection circuitry 64 to select, based on respective bit portions of the multiplicand operand M, a number of Booth digits (BD) to be used for generating partial products. The Booth digit selection will be discussed in more detail below. The Booth digit selection may be independent of which size is indicated as the selected element size 58.
[0059] The multiplier 52 also includes partial product generation circuitry 66 which generates a number of partial products to be added by an adder 68 to generate the result value 60. For each element of the result, the partial product generation circuitry generates a set of partial products to be added to form that result element. For each partial product, one of the multiples of the corresponding multiplier element of operand R is selected, from among the multiples generated by the multiple generator 62. Which particular multiple of the relevant multiplier element R[i] is selected depends on the bit values of a corresponding Booth digit BD generated by the Booth digit selection circuitry 64 based on the multiplicand operand M.
[0060] The partial product generation circuitry 66 supplies the generated partial products to the adder 68, which adds the partial products to produce the result value 60. The adder 68 may comprise a summation network 67 (e.g. a 3:2 carry save adder tree) for summing the partial products, and a carry propagate adder 69 for adding the sum and carry terms produced by the summation network 67. Each of the multiple generator 62, the partial product generation circuitry 66 and the adder 68 perform a configurable operation, based on the selected element size 58 under control of the control circuitry 54.
[0061] Modified Booth multiplication is based on the principle that, within the multiplicand, a string of consecutive binary Is can effectively be replaced with a +1 at the upper end of the string and a 1 at the lower end of the string, which can help to reduce many of the partial products to zero, which can make processor logic implementation more straightforward. This is analogous to 999 in decimal being equivalent to 10001. Hence, if considering a multiplication of 999*R, the schoolbook long multiplication approach would carry out a series of additions of partial products 900*R+90*R+9*R, with the Booth approach this could be reduced to 1000*R1*R.
[0062] Appendix A illustrates an example of radix-2 modified Booth multiplication. As shown in
[0063] As shown in Appendix B and
[0064] As shown in Appendix C and
[0065] Again, in Appendix C the same example multiplication 5647 is shown and each partial product is selected as one of the multiples of R generated by the multiple generation circuitry 62 as selected based on the bit pattern of the corresponding Booth digits selected by the Booth digit selection circuitry 64. The adder 68 adds the partial products generated by the partial product generation circuitry 66, this time with a shift of 3 bits between respective partial products (to align the bits of corresponding significance within the result). The result value again matches the result produced in Appendices A and B. It can be seen that by using radix-8 modified Booth multiplication, the result can be generated with only three partial products as opposed to four partial products in the example of Appendix B.
[0066] Appendix D shows a corresponding example for radix-8 modified Booth multiplication when applied to a 16-bit16-bit multiplication. As shown in
[0067] However, as shown in
[0068] However, a problem with this approach is that the phase relationship between the position of the Booth digits and the element boundaries varies from element to element when the selected element size is smaller than the maximum size. For example, when 16-bit elements are used then the Booth digit BD.sub.5 covering the upper bits of element 0 only includes two of the bits (M.sub.15M.sub.14) of element 0, while the most significant Booth digit BD.sub.10 in element 1 would comprise 3 bits (M.sub.31M.sub.29) of element 1. Similarly, when 8-bit element sizes are used then the most significant Booth digit BD.sub.2 of element 0 would include three bits of element 0, the most significant Booth digit BD.sub.5 in element 1 would include two bits of that element and the most significant Booth digit BD.sub.8 in element 2 would include one bit of element 2.
[0069] Therefore, if the Booth digits are selected according to the scheme shown in
[0070] This problem can be addressed by changing the approach to Booth digit selection.
[0071] As shown in
[0072] 8-bit elements: BD.sub.2, BD.sub.5, BD.sub.8, BD.sub.11 all sign extended;
[0073] 16-bit elements: BD.sub.2, BD.sub.8 zero extended and BD.sub.5, BD.sub.11 sign extended;
[0074] 32-bit elements: BD.sub.2, BD.sub.5, BD.sub.8 zero extended and BD.sub.11 sign extended.
The zeroing at the lower end of each partition is the same for each sub-portion, regardless of the selected element size and regardless of the relative position of that sub-portion within the overall multiplicand operand M. For the upper end of each partition, while whether a zero extension or sign extension is used depends on the selected element size (and whether the multiplication is an unsigned or signed operation), the bit position of the bits to be filled with the sign or zero extension is the same regardless of the selected element size (in contrast, with the approach shown in
[0075] The approach shown in
[0076] This approach is counter intuitive since at the larger element sizes, selecting the Booth digits in the way shown in
TABLE-US-00007 (A) (B) (C) no. of partial no. of partial products no. of partial products (C) (C) Element products (Radix-8 with digit (Radix-8 with modified compared to compared to size (Radix-4) selection of FIG. 3) digit selection of FIG. 4) (A) (B) 8 bits 5 3 3 40% 0% 16 bits 9 6 6 33% 0% 32 bits 17 11 12 29% +9% 64 bits 33 22 24 27% +9%
As shown in the table, while at element sizes of 32 or 64 bits, using the modified Booth encoding scheme (C) does increase the number of partial products required at 32-bit or 64-bit multiplications compared to scheme (B), this is still a significant decrease in the number of partial products required for Booth-2 multiplication, providing a 29% or 27% reduction in the number of partial products and hence an improvement in performance. Hence, even though the scheme shown in
[0077] As shown in Appendix E when the modified Booth digit selection approach is used for carrying out the same 16b*16b multiplication 572*399 as shown in Appendix D, this gives the same result as Appendix D. Note that the relative offset between the positions at which the partial products are added varies from partial product to partial product (shifts of 2 or 3 between partial products), which matches the variable offsets SH between the positions at which the Boot digits were extracted from the multiplicand operand M.
[0078] The multiple generator 62 supports a configurable element size and so generates the various multiples of R based on each individual data element within the multiplier operand R. The +R multiples are simply equal to the input value of each element of the multiplier operand R. The R multiple may be generated by generating the 2's complement of the input value of the multiplier R in each element of the multiplier operand. This may involve inverting all the bits of the multiplier operand and adding 1 at the least significant bit of each data element within the multiplier. The 2R and 4R multiples can be generated by a shifter which shifts the R multiples left by one or two bit positions. This can be done with a configurable shifter which can be selectively partitioned at the element boundaries so that the most significant bit on the right hand side of an element boundary does not get shifted into the least significant bit on the left hand side of the element boundary (which is part of a different data element). Any known configurable shifted design for SIMD shifts can be used for this.
[0079] The generation of the plus or minus 3R multiples may be more complex and requires an addition of the 2R and R values within each independent lane of the multiplier operand R. To avoid a full carry propagate adder (CPA) the addition is broken into small segments. Increasing the segment width is shown to reduce delay and area. A segment width equivalent to the minimal element width can be used.
[0080]
[0081] Hence, the multiples +4R[i] to 4R[i] are generated for each respective multiplier element R[i] of the multiplier operand R. For each partial product that corresponds to a given result element of the result value, one of the multiples generated from the corresponding element of the multiplier operand R is selected by the partial product generation circuitry 66, based on a Booth digit extracts from a corresponding element of the multiplicand operand M. The partial products are then supplied to the adder 68.
[0082]
[0083] When the selected element size is smaller than the maximum size supported, then some of the input bits to the adder may be set to zero to indicate that these parts of the adder do not comprise any valid data. For example where the data element size is half the maximum size, then the portions of the adder marked 72 in
[0084] It will be appreciated that this is just one possible design for the adder and other examples could also be used. For example, in some systems to reduce circuit complexity, instead of adding all of the partial products substantially in parallel as shown in the example of
[0085]
[0086] Meanwhile, at step 104 the Booth digit selection circuitry 64 extracts a number of Booth digits BD according to a radix-N modified Booth encoding scheme. The Booth digit selection is independent of which element size has been selected as the selected element size. The Booth digit selection uses the scheme shown in
[0087] Although illustrative embodiments of the invention have been described in detail herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments, and that various changes and modifications can be effected therein by one skilled in the art without departing from the scope and spirit of the invention as defined by the appended claims.
Appendix aRadix-2 Modified Booth Multiplication
[0088] M*R (M=multiplicand, R=multiplier)
Multiple selection table:
TABLE-US-00008 Selected Booth digit Multiple of R selected M.sub.i, M.sub.i1 as partial product 00 0 01 +1 10 1 11 0
Example M=56, R=47
[0089] M=00111000(0)
[0090] R=00101111
[0091] R=11010001
BD.sub.0=M.sub.0, M.sub.1=0(0)->PP.sub.0=0
BD.sub.1=M.sub.1, M.sub.0=00->PP.sub.1=0
BD.sub.2=M.sub.2, M.sub.1=00->PP.sub.2=0
BD.sub.3=M.sub.3, M.sub.2=10->PP.sub.3=R=11010001
BD.sub.4=M.sub.4, M.sub.3=11->PP.sub.4=0
BD.sub.5=M.sub.5, M.sub.4=11->PP.sub.5=0
BD.sub.6=M.sub.6, M.sub.5=01->PP.sub.6=+R=00101111
BD.sub.7=M.sub.7, M.sub.6=00->PP.sub.7=0
BD.sub.8=SE, M.sub.7=00->PP.sub.8=0
Add partial products (shifted by 1 each time):
9 partial products for 8*8 bit multiplication
Booth-2 not used in practice as would need same number of partial products as schoolbook long multiplication approach.
Appendix BRadix-4 Modified Booth Multiplication (Booth2)
[0092] M*R (M=multiplicand, R=multiplier)
Multiple selection table:
TABLE-US-00009 Selected Booth digit Multiple of R selected M.sub.i+1, M.sub.i, M.sub.i1 as Partial Product 000 0 001 +1 010 +1 011 +2 100 2 101 1 110 1 111 0
Example M=56, R=47
[0093] M=00111000
[0094] +2R=01011110
[0095] +R=00101111
[0096] R=11010001
[0097] 2R=10100010
BD.sub.0=M.sub.1,M.sub.0,M.sub.1=00(0)->PP.sub.0=0
BD.sub.1=M.sub.3,M.sub.2,M.sub.1=100->PP.sub.1=2R=10100010
BD.sub.2=M.sub.5,M.sub.4,M.sub.3=111->PP.sub.2=0
BD.sub.3=M.sub.7,M.sub.6,M.sub.5=001->PP.sub.3=+R=00101111
BD.sub.4=SE, SE, M.sub.7=000->PP.sub.4=0
Add partial products (shifted by 2 each time):
5 partial products
Appendix CRadix-8 Modified Booth Multiplication (Booth3)8b*8b Multiplication
[0098] M*R (M=multiplicand, R=multiplier)
Multiple selection table:
TABLE-US-00010 Selected Booth digit Multiple of R selected M.sub.i+2: M.sub.i1 as Partial product 0000 0 0001 +1 0010 +1 0011 +2 0100 +2 0101 +3 0110 +3 0111 +4 1000 4 1001 3 1010 3 1011 2 1100 2 1101 1 1110 1 1111 0
Example M=56, R=47
[0099] M=00111000
[0100] +4R=10111100
[0101] +3R=01110011
[0102] +2R=01011110
[0103] +R=00101111
[0104] R=11010001
[0105] 2R=10100010
[0106] 3R=(1) 01110011
[0107] 4R=(1) 01000100
BD.sub.0=M.sub.2:M=000(0)->PP.sub.0=0
BD.sub.1=M.sub.5:M.sub.2=1110->PP.sub.1=R=11010001
BD.sub.2=M.sub.8:M.sub.5=(0)001 (bit 8 sign extended from bit 7)->PP.sub.2=+R=00101111
Add partial products (shift by 3 each time):
3 partial products
Appendix DRadix-8 Modified Booth Multiplication (Booth3)16b*16b Multiplication, Conventional Digit Selection
[0108] M*R (M=multiplicand, R=multiplier)
Multiple selection table same as Appendix C.
Example M=572, R=399
[0109] M=00000010 00111100
[0110] +4R=00000110 00111100
[0111] +3R=00000100 10101101
[0112] +2R=00000011 00011110
[0113] +R=00000001 10001111
[0114] R=11111110 01110001
[0115] 2R=11111100 11100010
[0116] 3R=11111011 01010011
[0117] 4R=11111001 11000100
BD.sub.0=M.sub.2:M.sub.1=100(0)->PP.sub.0=4R=11111001 11000100
BD.sub.1=M.sub.5:M.sub.2=1111->PP.sub.1=0
BD.sub.2=M.sub.8:M.sub.5=0001->PP.sub.2=+R=00000001 10001111
BD.sub.3=M.sub.11:M.sub.8=0010->PP.sub.3=+R=00000001 10001111
BD.sub.4=M.sub.14:M.sub.11=0000->PP.sub.4=0
BD.sub.5=M.sub.17:M.sub.14=(00)00->PP.sub.5=0
Add partial products (shift by 3 each time):
6 partial products.
Appendix ERadix-8 Modified Booth Multiplication (Booth3)16b*16b Multiplication, Modified Digit Selection
[0118] M*R (M=multiplicand, R=multiplier)
Multiple selection table same as Appendix C.
Example M=572, R=399
[0119] M=00000010 00111100
[0120] +4R=00000110 00111100
[0121] +3R=00000100 10101101
[0122] +2R=00000011 00011110
[0123] +R=00000001 10001111
[0124] R=11111110 01110001
[0125] 2R=11111100 11100010
[0126] 3R=11111011 01010011
[0127] 4R=11111001 11000100
BD.sub.0=M.sub.2:M.sub.1=100(0)->PP.sub.0=4R=11111001 11000100
BD.sub.1=M.sub.5:M.sub.2=1111->PP.sub.1=0
BD.sub.2=S:M.sub.7:M.sub.5=(0)001 (zero extend for bit 8)->PP.sub.2=+R=00000001 10001111
BD.sub.3=M.sub.10:M.sub.8:0=010(0) (0 at least significant bit)->PP.sub.3=+2R=00000011 00011110
BD.sub.4=M.sub.13:M.sub.10=0000->PP.sub.4=0
BD.sub.5=M.sub.16:M.sub.13=(0)000 (sign extend for bit 16)->PP.sub.5=0
Add partial products (shift by 3 or 2 each time):
6 partial products. Same result as in Appendix D. Modified digit selection means position of Booth digits is independent of data element size and less multiplexing circuitry needed to select the multiple of R based on the Booth digit.