COMPUTER SYSTEM USING A PLURALITY OF SINGLE INSTRUCTION MULTIPLE DATA (SIMD) ENGINES FOR EFFICIENT MATRIX OPERATIONS
20210406030 · 2021-12-30
Inventors
Cpc classification
G06F9/3887
PHYSICS
G06F17/16
PHYSICS
G06F9/3012
PHYSICS
Y02D10/00
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
International classification
Abstract
A computer system including a plurality of SIMD engines and a corresponding plurality of output register sets. Operand A register file stores one or more Operand A values, each including a plurality of operand words. Operand B register file stores one or more Operand B values, each including a plurality of operand words. Operand A distribution circuit receives an Operand A value from the Operand A register file, and selectively routes one or more of the operand words of the received Operand A value to create a plurality of input Operand A values, which are selectively routed to the SIMD engines. Operand B distribution circuit receives one or more Operand B values from the Operand B register file, and selectively routes one or more of the operand words of the Operand B value(s) to create a plurality of input Operand B values, which are selectively routed to the SIMD engines.
Claims
1. A method of performing matrix multiplication of a first matrix and a second matrix using a computer system including a plurality (N) of single instruction multiple data (SIMD) engines and a plurality (N) of corresponding output registers, the method comprising: identifying a plurality of non-zero entries included in the first matrix, wherein each of the non-zero entries has a corresponding column address and a corresponding row address within the first matrix; for each non-zero entry of the identified non-zero entries, identifying one of the SIMD engines and a corresponding one of the output registers to process the non-zero entry in response to the corresponding row address of the non-zero entry; sorting the non-zero entries based on the identified SIMD engines and corresponding output registers, thereby creating a plurality of first operand values, wherein each of the first operand values includes a plurality of the non-zero entries, each having a different identified SIMD engine and corresponding output register; and routing the first operand values to the SIMD engines to perform multiply operations, wherein the routing causes each of the non-zero entries included in the first operand values to be provided to the identified SIMD engines.
2. The method of claim 1, further comprising: for each non-zero entry of the identified non-zero entries, identifying a row of entries within the second matrix in response to the corresponding column address of the non-zero entry; and routing the identified rows of entries to the SIMD engines to perform multiply operations, wherein the each of the SIMD engines multiples a non-zero entry with its identified row of entries.
3. The method of claim 21, further comprising: for each non-zero entry of the identified non-zero entries, identifying a row within the corresponding one of the output registers in response to the corresponding row address of the non-zero entry; and performing accumulate operations by accessing the identified rows of the output registers.
4. The method of claim 1, wherein each row of the first matrix represents a weight vector in a machine learning system, and each column of the second matrix represents an activation vector in the machine learning system.
5. The method of claim 1, further comprising ignoring any zero entries in the first matrix.
6. The method of claim 1, further comprising assigning a unique entry in the output registers to each dot product of the matrix multiplication.
7. The method of claim 1, wherein the first operand values are routed to the SIMD engines such that each of the SIMD engines receives multiple copies of one of the non-zero entries.
8. The method of claim 3, further comprising independently addressing each of the identified rows of the output registers.
9. The method of claim 1, wherein ⅛ or fewer of the entries of the first matrix are non-zero entries.
10. The method of claim 1, wherein identifying one of the SIMD engines and a corresponding one of the output registers to process the non-zero entry in response to the corresponding row address of the non-zero entry comprises: dividing the row address of the non-zero entry within the first matrix by the number N, and then using the remainder of this dividing operation to identify the one of the SIMD engines and the corresponding one of the output registers.
11. The method of claim 3, wherein identifying a row within the corresponding one of the output registers in response to the corresponding row address of the non-zero entry comprises: dividing the row address of the non-zero value within the first matrix by the number N and ignoring any remainder.
12. The method of claim 2, wherein identifying a row of entries within the second matrix in response to the corresponding column address of the non-zero entry comprises: identifying a row of entries in the second matrix having a row address equal to the corresponding column address of the non-zero entry.
13. A computer system comprising: a plurality (N) of single instruction multiple data (SIMD) engines; and a plurality (N) of corresponding output registers, the computer system configured to perform matrix multiplication of a first matrix and a second matrix by performing the steps of: identifying a plurality of non-zero entries included in the first matrix, wherein each of the non-zero entries has a corresponding column address and a corresponding row address within the first matrix; for each non-zero entry of the identified non-zero entries, identifying one of the SIMD engines and a corresponding one of the output registers to process the non-zero entry in response to the corresponding row address of the non-zero entry; sorting the non-zero entries based on the identified SIMD engines and corresponding output registers, thereby creating a plurality of first operand values, wherein each of the first operand values includes a plurality of the non-zero entries, each having a different identified SIMD engine and corresponding output register; and routing the first operand values to the SIMD engines to perform multiply operations, wherein the routing causes each of the non-zero entries included in the first operand values to be provided to the identified SIMD engines.
14. The computer system of claim 13, further configured to perform the steps of: for each non-zero entry of the identified non-zero entries, identifying a row of entries within the second matrix in response to the corresponding column address of the non-zero entry; and routing the identified rows of entries to the SIMD engines to perform multiply operations, wherein the each of the SIMD engines multiples a non-zero entry with its identified row of entries.
15. The computer system of claim 13, further configured to perform the steps of: for each non-zero entry of the identified non-zero entries, identifying a row within the corresponding one of the output registers in response to the corresponding row address of the non-zero entry; and performing accumulate operations by accessing the identified rows of the output registers.
16. The computer system of claim 13, wherein each row of the first matrix represents a weight vector in a machine learning system, and each column of the second matrix represents an activation vector in the machine learning system.
17. The computer system of claim 13, further configured to perform the step of ignoring any zero entries in the first matrix.
18. The computer system of claim 13, further configured to perform the step of assigning a unique entry in the output registers to each dot product of the matrix multiplication.
19. The computer system of claim 13, further configured to perform the step of routing the first operand values to the SIMD engines such that each of the SIMD engines receives multiple copies of one of the non-zero entries.
20. The computer system of claim 15, further configured to perform the step of independently addressing each of the identified rows of the output registers.
21. The computer system of claim 13, wherein ⅛ or fewer of the entries of the first matrix are non-zero entries.
22. The computer system of claim 13, wherein identifying one of the SIMD engines and a corresponding one of the output registers to process the non-zero entry in response to the corresponding row address of the non-zero entry comprises: dividing the row address of the non-zero entry within the first matrix by the number N, and then using the remainder of this dividing operation to identify the one of the SIMD engines and the corresponding one of the output registers.
23. The computer system of claim 15, wherein identifying a row within the corresponding one of the output registers in response to the corresponding row address of the non-zero entry comprises: dividing the row address of the non-zero value within the first matrix by the number N and ignoring any remainder.
24. The computer system of claim 14, wherein identifying a row of entries within the second matrix in response to the corresponding column address of the non-zero entry comprises: identifying a row of entries in the second matrix having a row address equal to the corresponding column address of the non-zero entry.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0032]
[0033]
[0034]
[0035]
[0036]
[0037]
[0038]
[0039]
[0040]
[0041]
[0042]
[0043]
[0044]
[0045]
[0046]
[0047]
[0048]
[0049]
[0050]
[0051]
[0052]
[0053]
[0054]
[0055]
[0056]
DETAILED DESCRIPTION
[0057] The following subsections describe various efficient SIMD engine architectures. Specifically, ways to operate multiple SIMD engines in parallel are proposed, and manners for supplying the SIMD engines with inputs are described. While the following description uses examples that implement 128-bit wide input operands and 4 SIMD engines, it is noted that the described examples can be extended to smaller or larger input operand widths and/or fewer or more SIMD engines.
[0058]
[0059] In general, control logic 430 controls writing operand values (e.g., matrix entries) into Operand A register file 411 and Operand B register file 412. More specifically, state machine and scheduler 431 causes operand packaging circuit 433 to retrieve matrix entries that are stored in system memory 440. Operand packaging circuit 433 packages these matrix entries to form operand values in accordance with the operation to be performed. In one embodiment, the various operations are defined by entries in the control registers 432. As described in more detail below, some operations (e.g., sparse matrix multiplication) require that matrix entries having zero values are omitted from the operand values provided to the operand buffer 410. State machine and scheduler 431 controls the writing of operand values provided by the operand packaging circuit 433 to the Operand A register file 411 and the Operand B register file 412. State machine and scheduler 431 also controls the reading of operand values from Operand A register file 411 and Operand B register file 412, wherein these read values are provided to Operand A distribution circuit 416 and Operand B distribution circuit 417 within input distribution block 415.
[0060] In general, state machine and scheduler 431 provides addresses to input distribution block 415, wherein these addresses control the manner in which the Operand A distribution circuit 416 routes the Operand A values received from Operand A register file 411 to SIMD block 401, and also control the manner in which the Operand B distribution circuit 417 routes the Operand B values received from Operand B register file 412 are routed to SIMD block 401. As described in more detail below, Operand B distribution circuit 417 may include buffers to store multiple Operand B values, as well as shift logic that controls an amount of shift to be applied to the Operand B values received from Operand B register file 412.
[0061] State machine and scheduler 431 also provides addresses used to access memory banks included within the output circuit 420. These addresses include read addresses, which enable accumulation values stored in the memory banks to be routed to the SIMD block 401 for multiply-accumulate operations, as well as write addresses, which enable updated accumulation values provided by SIMD block 401 to be written back to the memory banks within output circuit 420.
[0062] Control registers 432 store values that control the manner in which the state machine and scheduler 431 generates the various addresses for different modes of operation (which are described in more detail below). The operation of the various elements of computer system 400 is described in more detail below for various modes (i.e., architectures).
Architectures for Operand A
[0063] Three architectures, which are described in more detail below, are proposed for input Operand A. In these following examples, the SIMD block 401 includes four SIMD engines operating in parallel, wherein each of the four SIMD engines receives a corresponding input Operand A having a width of 128 bits. A single entry from the Operand A register file 411 (which is included in operand buffer 410) is 128 bits. This entry is hereinafter referred to as a register file word. In the described embodiments, each of the four SIMD engines within SIMD block 401 is identical to the SIMD engine 300 of
Architecture 1A
[0064] In a first architecture for providing Operand A to the SIMD block 401 (Architecture 1A), each of the four SIMD engines (SIMD.sub.0, SIMD.sub.1, SIMD.sub.2, SIMD.sub.3) included in the SIMD block 401 receives a full register file word (which includes four 32-bit word values w, x, y and z) as the input Operand A.
[0065]
[0066]
Architecture 2A
[0067] In a second architecture for providing Operand A to the SIMD block 401 (Architecture 2A), each of the four SIMD engines (SIMD.sub.0, SIMD.sub.1, SIMD.sub.2, SIMD.sub.3) included in the SIMD block 401 receives a single input word from the operand A register file 411, wherein this single input word is repeated a number of times to match the input width of Operand A. Input distribution block 415 selects the single input word by specifying the index of the single input word to be broadcast within the operand A register file 411.
[0068]
[0069]
[0070]
[0071]
Architecture 3A
[0072] In a third architecture for providing the input Operand A to the SIMD block 401 (Architecture 3A), each of the four SIMD engines (SIMD.sub.0, SIMD.sub.1, SIMD.sub.2, SIMD.sub.3) included in the SIMD block 401 receives a single input word from the operand A register file 411, wherein this single input word is repeated a number of times to match the input width of Operand A. However, different SIMD engines are provided with different input words. In one embodiment, the input words are assigned to the SIMD engines in a round-robin manner. Input distribution block 415 selects the single input word for each SIMD by specifying the index of each input word to be provided to each SIMD.
[0073]
[0074]
[0075] Note that in the 16-bit input mode represented by
[0076]
[0077] More details regarding the routing of input operand A in accordance with Architectures 1A, 2A and 3A are provided below in connection with
[0078] Note that the preceding descriptions of Architectures 1A, 2A and 3A implement 32-bit input and 16-bit input modes. However, these embodiments are provided for illustration purpose only. The ideas are general and can be extended in a straightforward manner to other input modes (e.g., 8-bit input mode). Moreover, although the Architectures 1A, 2A and 3A have been described in connection with embodiments that include 4 SIMD engines and a 128-bit register file word, other numbers of SIMD engines and register file word widths can be used in other embodiments in a straightforward manner.
[0079] In actual hardware implementation, multiple architectures can be implemented together by sharing hardware resources. The hardware can be programmed to operate different architectures as modes that can be chosen by some register settings. For example, control registers 432 (
[0080] For example, in Architecture 2A, the index of the single value to be broadcast needs to be provided to the Operand A distribution circuit 416. The index can have different interpretations depending on whether the data is 8-bit, 16-bit or 32-bit wide (which could be specified by a control register setting).
[0081] Similarly, in Architecture 3A, the index of the value to be broadcast to SIMD.sub.0 needs to be provided to the Operand A distribution circuit 416. From this index, the indices for the values to be broadcast to the other SIMD engines (SIMD.sub.1, SIMD.sub.2 and SIMD.sub.3) can be inferred by the hardware by incrementing. Or all the four indices can be provided to the Operand A distribution circuit 416.
[0082] The data stored in the buffers of the Operand A distribution circuit 416 can be reused over multiple cycles so that the register file words do not need to be read every cycle from the Operand A register file 411. Separate control logic can supply a flag specifying which cycles need to load the data from the Operand A register file 411. Additionally, the Operand A distribution circuit 416 can contain multiple buffers to hold the register file word data with control logic specifying the buffer indices to use for writing and reading. In one embodiment, the Operand A distribution circuit 416 contains two buffers: one for writing and one for reading, which are used in a ping-pong manner. The state machine and scheduler 431 automatically manages the read and write indices. This scheme is generally known as double buffering. In such cases, no additional control logic is needed to specify buffer indices for read and write.
Architectures for Operand B
[0083] In accordance other embodiments, multiple architectures are used to provide the input Operand B to the SIMD block 401. As described below, Operand B distribution circuit 417 can be configured in four different architectures (Architecture 1B, Architecture 2B, Architecture 3B and Architecture 4B) to provide the input Operand B to SIMD block 401.
Architecture 1B
[0084] In a first architecture for providing Operand B to the SIMD block 401 (Architecture 1B), each of the four SIMD engines (SIMD.sub.0, SIMD.sub.1, SIMD.sub.2, SIMD.sub.3) included in the SIMD block 401 receives a full register file word (which includes four 32-bit word values a, b, c and d) as the input Operand B. Note that Architecture 1B for providing Operand B to the SIMD block 401 is similar to Architecture 1A for providing Operand A to the SIMD block 401.
[0085]
[0086]
Architecture 2B
[0087] In the architectures considered so far (for Operand A as well as for Operand B), all the SIMD engines use data from a single register file word at a given time. This can make the architectures rigid in terms of the type of operations they can support. In accordance with one embodiment, different SIMD engines are provided with different register file words from the operand register files.
[0088] One method to achieve this would be to allow multiple reads to the register file simultaneously. While this is possible, the hardware complexity can be prohibitive.
[0089] In one embodiment, multiple entries (not necessarily distinct) can be read simultaneously from Operand B register file 412. The most general way to implement this is to use a multi-read-port memory to implement this register file 412. A memory with four read ports can be used to simultaneously read four entries from the Operand B register file 412. However, such a memory configuration has a high hardware complexity (occupies a relatively large area and consumes a relatively high power). Thus, preferred embodiments of the present invention include low complexity methods and structures for supplying the different SIMD engines with (possibly) different input Operand B values. While these preferred embodiments may not provide as much generality as the broad (multiple read port) method, they are efficient for the purposes of the algorithms to be implemented.
[0090] In accordance with one embodiment, a second architecture (Architecture 2B) for providing input Operand B to the SIMD engines is provided, wherein a small number of entries from the Operand B register file 412 are buffered in the Operand B distribution circuit 417 and then distributed to the SIMD engines of SIMD block 401. Intuitively, this can be thought of as an approach that gives some flexibility for each SIMD by allowing them to address any entry from a small number of entries. This keeps hardware complexity small.
[0091] The main characteristics of the second architecture (Architecture 2B) for providing the input Operand B to the SIMD block 401 can be defined as follows. The Operand B distribution circuit 417 includes a plurality of Operand B buffers to hold values read from the Operand B register file 412. Each of these Operand B buffers can hold one full register file word. Each SIMD can receive the register file word stored in any one of the Operand B buffers. A buffer select mechanism is used to specify which of the Operand B buffers is coupled to each of the SIMD engines. The Operand B buffers are filled one at a time from the Operand B register file 412. When a new register file word needs to be loaded into the Operand B buffers from Operand B register file 412, one of the previous Operand B buffers is overwritten. There can be multiple schemes to determine which Operand B buffer needs to be overwritten. One simple scheme is that the Operand B buffer with oldest data is overwritten (i.e., the Operand B buffers are used in a round-robin fashion). In another scheme, control logic 430 can specify which Operand B buffer needs to be overwritten.
[0092] It is not necessary to load the data from the Operand B register file 412 into the operand B buffers during every cycle. Separate control logic 430 can specify a flag for every cycle to indicate if new data needs to be read from the Operand B register file 412 into the Operand B buffers of Operand B distribution circuit 417. In the actual hardware implementation, each Operand B buffer may use a double buffering scheme so that read and write operations to an Operand B buffer do not occur in the same cycle.
[0093]
[0094] Operand B buffer select logic 1601 (which may be included in the state machine and scheduler 431 of control logic 430) is used to determine the manner in which the contents of Operand B buffers B0, B1, B2 and B3 are provided to SIMD.sub.0, SIMD.sub.1, SIMD.sub.2 and SIMD.sub.3. In one embodiment, Operand B buffer select logic 1601 includes four buffer select entries bs0, bs1, bs2 and bs3, which store values that specify which of the Operand B buffers B0-B3 provide their contents to SIMD.sub.0, SIMD.sub.1, SIMD.sub.2 and SIMD.sub.3, respectively. In the illustrated example, the four entries bs0, bs1, bs2 and bs3 specify operand B buffers B0, B0, B1 and B2, respectively, indicating that the contents of operand B buffer BC (i.e., [a, b, c, d]) are provided to SIMD.sub.0 and SIMD.sub.1, the contents of operand B buffer B1 (i.e., [e, f, g, h]) are provided to SIMD.sub.2, and the contents of operand B buffer B2 (i.e., [i, j, k, l]) are provided to SIMD.sub.3. In subsequent cycles, the buffer selection may change by changing the buffer select entries bs0, bs1, bs2 and bs3. It is noted that if the number of operand buffers is reduced to 1, then Architecture 2B would be equivalent to Architecture 1B.
Architecture 3B
[0095] Another approach to effectively allow for multiple reads from the operand B register file 412 is to implement the operand B register file 412 using a plurality of register files, each of which allows a single read operation to be performed at a time. As noted before, having one large memory with 4 read ports can be more expensive than four smaller memories with one read port each. However, it is worth noting that the larger memory with 4 read ports offers more flexibility in terms of the data that can be read. When four smaller memories with single read port are used, four entries can be read at a given time, but each of the entries has to belong to a different memory. This is not the case with a 4 read-port memory that allows any 4 entries to be read simultaneously.
[0096] The main characteristics of the third architecture (Architecture 3B) for providing the input Operand B to the SIMD block 401 can be defined as follows. There is more than one Register File for Operand B. In one specific case, the number of Register Files for Operand B is equal to the number of SIMD engines included in SIMD block 401. Thus, if there are four SIMD engines, then there will be four corresponding Operand B register files. However, other cases are possible and it is easy to extend the architecture to those cases.
[0097] The multiple Operand B Register Files can be read simultaneously. In a simple case, each SIMD receives its input Operand B directly from one of the Operand B register files. If the number of SIMD engines is equal to the number of operand B register files, then each of the SIMD engines can receive an input Operand B from a corresponding one of the Operand B register files.
[0098] In the general case, the Operand B distribution circuit 417 can contain operand buffers (similar to Architecture 2B) to hold the data read from the Operand B register files. This can allow multiple cycles to use same data. Also, the Operand B register files need not be read every cycle due to reuse of the buffered data. A load flag can specify the cycles in which data needs to be read from the Operand B register files to the Operand B distribution circuit 417. A separate block can also specify the address of the buffer to load for every SIMD, as described above in connection with Architecture 2B.
[0099]
[0100] Operand B buffer select logic 1701 (which may be included in the state machine and scheduler 431 of control logic 430) is used to determine the manner in which the contents of Operand B buffers BM.sub.0, BM.sub.1, BM.sub.2 and BM.sub.3 are provided to SIMD.sub.0, SIMD.sub.1, SIMD.sub.2 and SIMD.sub.3. In one embodiment, Operand B buffer select logic 1701 includes four buffer select entries bms0, bms1, bms2 and bms3, which store values that specify which of the Operand B buffers BM.sub.0-BM.sub.3 provide their contents to SIMD.sub.0, SIMD.sub.1, SIMD.sub.2 and SIMD.sub.3, respectively. In the illustrated example, the four entries bms0, bms1, bms2 and bms3 specify operand B buffers BM.sub.0, BM.sub.1, BM.sub.2 and BM.sub.3, respectively, indicating that the contents of operand B buffer BM.sub.0 (i.e., [a0, b0, c0, d0]) are provided to SIMD.sub.0, the contents of operand B buffer BM.sub.1 (i.e., [e0, f0, g0, h0]) are provided to SIMD.sub.1, the contents of operand B buffer BM.sub.2 (i.e., [i0, j0, k0, l0]) are provided to SIMD.sub.2, and the contents of operand B buffer BM.sub.3 (i.e., [m0, n0, o0, p0]) are provided to SIMD.sub.3. In subsequent cycles, the buffer selection may change by changing the buffer memory select entries bms0, bms1, bms2 and bms3.
[0101]
[0102] Operand B buffer select logic 1801 (which may be included in the state machine and scheduler 431 of control logic 430) is used to determine the manner in which the contents of Operand B buffers B01-B02, B11-B12, B21-B22 and B31-B32 are provided to SIMD.sub.0, SIMD.sub.1, SIMD.sub.2 and SIMD.sub.3. In one embodiment, Operand B buffer select logic 1801 includes four buffer select entries bs01, bs11, bs21 and bs31, which store values that specify which of the Operand B buffers B01-B02, B11-B12, B21-B22 and B31-B32 provide their contents to SIMD.sub.0, SIMD.sub.1, SIMD.sub.2 and SIMD.sub.3, respectively. In the illustrated example, the four entries bs01, bs11, bs21 and bs31 specify operand B buffers B02, B12, B22 and B32, respectively, indicating that the contents of Operand B buffer B02 (i.e., [a0, b0, c0, d0]) are provided to SIMD.sub.0, the contents of Operand B buffer B12 (i.e., [e0, f0, g0, h0]) are provided to SIMD.sub.1, the contents of Operand B buffer B22 (i.e., [i0, j0, k0, 10]) are provided to SIMD.sub.2, and the contents of Operand B buffer B32 (i.e., [m0, n0, o0, p0) are provided to SIMD.sub.3. In one embodiment, Operand B distribution circuit 417 includes switching/demultiplexing circuitry that performs the above-described routing in response to the buffer select entries bs01, bs11, bs21 and bs31. Note that Operand B buffer select logic 1801 can select any of the operand buffers B01-B02, B11-B12, B21-B22 and B31-B32 to provide input Operand B to any of the SIMD engines. For example, buffer select entry bs01 may store a value (B31) that causes the contents of Operand B buffer B31 (i.e., [m1, n1, o1, p1] to be routed to SIMD.sub.0. In subsequent cycles, the buffer selection may change by changing the buffer select entries bs01, bs11, bs21 and bs31. Note that in other embodiments, different numbers of Operand B buffers can be included in Operand B distribution circuit 417.
Architecture 4B
[0103] In a fourth architecture for providing Operand B to the SIMD block 401 (Architecture 4B), an architecture similar to Architecture 3B is provided, with the added feature that each Operand B register file allows reading two entries at a time and choosing one register file word worth of data by applying some shifting operations. Control logic 340 specifies the addresses of two rows to be read from each Operand B register file, as well as the amount of shift to be applied to the entries read from these two rows. This functionality is typically realized in hardware by implementing each Operand B register file memory as two banks of memory. This allows reading two entries at the same time. The two register file words are then fed into a shifting logic module that receives an amount of shift as an input parameter and outputs one register file word worth of data. The addresses for the two banks and the amount of shift are supplied by state machine and scheduler 431.
[0104]
[0105] Memory bank 1912.sub.10 stores register file words [e0, f0, g0, h0], [e2, f2, g2, h2] and [e4, f4, g4, h4] and memory bank 1912.sub.11 stores register file words [e1, f1, g1, h1], [e3, f3, g3, h3] and [e5, f5, g5, h5].
[0106] Memory bank 1912.sub.20, stores register file words [i0, j0, k0, l0], [i2, j2, k2, l2] and [i4, j4, k4, 14] and memory bank 1912.sub.21 stores register file words [i1, j1, k1, l1], [i3, j3, k3, l3] and [i5, j5, k5, l5].
[0107] Memory bank 1912.sub.30, stores register file words [m0, n0, o0, p0], [m2, n2, o2, p2] and [m4, n4, o4, p4] and memory bank 1912.sub.31 stores register file words [m1, n1, o1, p1], [m3, n3, o3, p3] and [m5, n5, o5, p5].
[0108] Register file words read from the memory bank pairs 1912.sub.00-1912.sub.01, 1912.sub.10-1912.sub.11, 1912.sub.20-1912.sub.21 and 1912.sub.30-1912.sub.31 are provided to shift logic circuit 1901 in Operand B distribution circuit 417. Outputs of shift logic circuit 1901 are provided to Operand B buffers B0, B1, B2 and B3 in Operand B distribution circuit 417.
[0109] Control logic 430 (and more specifically, state machine and scheduler 431) controls the register file words read from memory banks 1912.sub.00-1912.sub.01, 1912.sub.10-1912.sub.11, 1912.sub.20-1912.sub.21 and 1912.sub.30-1912.sub.31. In general, control logic 430 causes register file words to be simultaneously read from the memory banks 1912.sub.00-1912.sub.01, 1912.sub.10-1912.sub.11, 1912.sub.20-1912.sub.21 and 1912.sub.30-1912.sub.31. The addresses provided to each of the memory bank pairs may selected such that two different consecutive register file words are read from each of the memory banks, thereby providing the register file words necessary to perform a shifting operation. For example, register file words [a0, b0, c0, d0] and [a1, b1, c1, d1] may be simultaneously read from memory banks 1912.sub.00 and 1912.sub.01, respectively; register file words [e0, f0, g0, h0] and [e1, f1, g1, h1] may be simultaneously read from memory banks 1912.sub.10 and 1912.sub.11, respectively; register file words [i0, j0, k0, l0] and [i1, j1, k1, l1] may be simultaneously read from memory banks 1912.sub.20 and 1912.sub.21, respectively; and register file words [m0, n0, o0, p0] and [m1, n1, o1, p1] may be simultaneously read from memory banks 1912.sub.30 and 1912.sub.31, respectively. The shift logic circuit 1901 receives the eight register file words provided by Operand B register files 1912.sub.0-1912.sub.3.
[0110] Control logic 340 also controls the amount of shift introduced by shift logic circuit 1901. In general, Table 1 below defines the values provided by shift logic circuit 1901 to operand buffers B0-B3 in the present example, for various shift values. Note that each shift value introduces an additional 32-bit shift to the received pairs of register file words.
TABLE-US-00001 TABLE 1 Shift B0 B1 B2 B3 0 [a0 b0 c0 d0] [e0 f0 g0 h0] [i0 j0 k0 l0] [m0 n0 o0 p0] 1 [b0 c0 d0 a1] [f0 g0 h0 e1] [j0 k0 l0 i1] [n0 o0 p0 m1] 2 [c0 d0 a1 b1] [g0 h0 e1 f1] [k0 l0 i1 j1] [o0 p0 m1 n1] 3 [d0 a1 b1 c1] [h0 e1 f1 g1] [10 i1 j1 k1] [p0 m1 n1 o1] 4 [a1 b1 c1 d1] [e1 f1 g1 h1] [i1 j1 k1 l1] [m1 n1 o1 p1]
[0111] The contents of operand B buffers B0, B1, B2 and B3 are routed to SIMD.sub.0, SIMD.sub.1, SIMD.sub.2 and SIMD.sub.3, respectively, as input operand B. In the foregoing manner, shifting may be efficiently performed within the register file words stored by Operand B register files 1912.sub.0-1912.sub.3.
[0112] Again, in actual hardware implementation, multiple architectures can be implemented together by sharing hardware resources. The hardware can be programmed to operate different architectures as modes that can be chosen by a register.
[0113] More details regarding the routing of input Operand B in accordance with Architectures 1B, 2B, 3B and 4B are provided below in connection with
[0114] Although the preceding descriptions of Architectures 1B, 2B, 3B and 4B implement 32-bit input and 16-bit input modes, it is understood that these architectures can easily be modified to implement input modes of other widths (e.g., 8-bit input mode). Moreover, although the Architectures 1B, 2B, 3B and 4B have been described in connection with embodiments that include 4 SIMD engines and a 128-bit register file word, other numbers of SIMD engines and register file word widths can be used in other embodiments in a straightforward manner.
[0115] In actual hardware implementation, multiple architectures can be implemented together by sharing hardware resources. The hardware can be programmed to operate different architectures as modes that can be chosen by some register settings. For example, control registers 432 (
Output Circuitry
[0116] Output circuit 420 (
[0117]
[0118] As described in more detail below, the row addresses of the output registers associated with each SIMD can be thought of as input signals to the SIMD engines. The row address is the index of the row within the SIMD (referred to as relative index within the SIMD).
[0119]
A Unified Architecture
[0120]
[0121] The operand block 2210 is responsible for holding the operand data. The data from operand register files 2211 and 2212 is transferred to the input distribution block 2220 based on control signals that are explained below. In one embodiment of the unified architecture of system 2200, Operand A register file 2211 includes one register file, and Operand B register files 2212 include four register files. Further, the four register files for Operand B allow two simultaneous reads (i.e., the memory is split into two banks). While the general system can contain arbitrary number of register files for each of the operands, the optimal hardware should be designed such that it uses the least number of register files but supports all required operations.
[0122] In the illustrated, three control signals, OP_A_RF_SRC_ADDR_SEL, OP_A_RF_DEST_ADDR_SEL and OP_A_RF_LOAD_FLAG, are used to control the operation of Operand A register files 2211. Similarly, three control signals, OP_B_RF_SRC_ADDR_SEL, OP_B_RF_DEST_ADDR_SEL and OP_B_RF_LOAD_FLAG, are used to control the operation of Operand B register files 2212.
[0123] The OP_A_RF_LOAD_FLAG and OP_B_RF_LOAD_FLAG signals specify if data needs to be transferred from the Operand A register files 2211 and the Operand B register files 2212, respectively, to the Operand A IDB buffers 2221 and Operand B IDB buffers 2222, respectively. If the OP_A_RF_LOAD_FLAG signal has a value of 1, then the two associated control signals (OP_A_RF_SRC_ADDR_SEL and OP_A_RF_DEST_ADDR_SEL) specify the source and destination addresses for the Operand A data. Similarly, if the OP_B_RF_LOAD_FLAG signal has a value of 1, then the two associated control signals (OP_B_RF_SRC_ADDR_SEL and OP_B_RF_DEST_ADDR_SEL) specify the source and destination addresses for the Operand B data. Note that not all embodiments will require destination addresses for the Operand A and Operand B data (i.e., if there is only one possible destination for the Operand A or Operand B data). If the OP_A_RF_LOAD_FLAG signal or the OP_B_RF_LOAD_FLAG signal has a value of 0, no data is read or transferred from the corresponding Operand A register files 2211 or the Operand B register files 2212. The OP_A_RF_LOAD_FLAG signal and the OP_B_RF_LOAD_FLAG signal can be generated by state machine and scheduler 431 of control logic 430.
[0124] The OP_A_RF_SRC_ADDR_SEL signal specifies the row address in the Operand A register file(s) 2211 to be read. In the modes of operation described above, the OP_A_RF_SRC_ADDR_SEL signal will include just one address, which specifies Operand A register file to be read. The OP_B_RF_SRC_ADDR_SEL signal specifies the row address(es) in the Operand B register file(s) 2212 to be read. Depending on the mode of the operation, the OP_B_RF_SRC_ADDR_SEL signal can be just one address (Architecture 1B or 2B) or four addresses (Architecture 3B) or 8 addresses (Architecture 4B). The hardware has appropriate modes to handle the different cases, wherein these modes are specified by control registers 432 of control logic 430. The above-described source addresses can be generated in hardware by state machine and scheduler 431 of control logic 430.
[0125] The OP_A_RF_DEST_ADDR_SEL signal specifies the destination address in the Operand A IDB buffers 2221, into which the data read from the Operand A register files 2211 is transferred. Similarly, the OP_B_RF_DEST_ADDR_SEL signal specifies the destination addresses in the Operand B IDB buffers 2222 (or shift logic 2224) to which the data read from the Operand B register files 2212 is transferred. Again depending on the mode of operation, these addresses can be a single address or multiple addresses. The addresses can be generated in hardware by state machine and scheduler 431 of control logic 430.
[0126] Note that multiple SIMD engines share the operand register files and control logic. This results in higher compute density i.e., more computation capacity per unit silicon area. Sharing of operand register files and control logic also saves power and SRAM bandwidth. The savings in SRAM bandwidth come from the fact that only two operand register files need to be written into to support multiple SIMD engines.
[0127] The input distribution block 2220 includes buffers 2221-2222 to hold the Operand data and logic blocks 2223-2224 to manipulate the data. As illustrated by
[0128] Instead of feeding SIMD engines directly from operand register files, the input distribution block 2220 acts as a small cache from which the SIMD engines are fed the operands. Input distribution block 2220 allows multiple SIMD engines to run in parallel using a single control circuit. As described above, the data from Operand A register file 2211 can be manipulated so that it provides distinct data to multiple SIMD engines.
[0129] The OP_A_IDB_ADDR_SEL signal specifies the address of the Operand A IDB buffer to be used for each SIMD. In all the architectures discussed, we have used only one register file word data for Operand A, but in the general case buffers 2221 can hold multiple register file words and each of the SIMD engines can possibly choose a different register file word from the buffers 2221. Typically there is only one Operand A buffer in the optimal architecture. However, this single buffer is actually implemented using a double buffer so that read and write do not happen to the same buffer in one cycle. The hardware manages this double buffer in a transparent way. Hence, this signal is internally managed by hardware in most cases.
[0130] The OP_A_IDB_DATA_SEL signal, which controls the Operand A IDB logic 2223, specifies the data that needs to be transferred to each SIMD. For example, in Architecture 2A, a single value is effectively replicated and broadcast to all SIMD engines. This signal specifies the index of the value that needs to be replicated. Similarly, in Architecture 3A, four consecutive values are taken from a register file word and each one of them is effectively replicated and sent to one SIMD. In this case, the OP_A_IDB_DATA_SEL signal specifies the index of the value that needs to be replicated for SIMD.sub.0. For the other SIMD engines (SIMD.sub.1-SIMD.sub.3), the index values are incremental. For Architecture 1A, since the full register word stored in buffer 2221 is sent to all of the SIMD engines, the OP_A_IDB_DATA_SEL signal is not needed.
[0131] The OP_B_IDB_SHIFT_SEL signal, which controls the Operand B IDB shift logic 2224, is used to control the manner in which register file words received from Operand B register files 2212 are shifted (i.e., when two register file words from the same register file for Operand B are read. Note that the OP_B_IDB_SHIFT_SEL signal (and the Operand B IDB shift logic 2224) is only required when the system 2200 is implementing Architecture 4B. In this case, the OP_B_IDB_SHIFT_SEL signal specifies how the two register file words need to be manipulated to produce one register file word (in the manner described above).
[0132] Convolution operations typically involve data shifts. Locating the Operand B shift logic 2224 between the Operand B register files 2212 and the SIMD engines advantageously reduces hardware overhead by allowing data to be read from the Operand B register files 2212 multiple times, with different shifts applied to the data each time. If the shift logic 2224 is not implemented in this manner, the shifted data would need to be written to Operand B register file 2212, and therefore could not be reused as many times as in the proposed architecture.
[0133] The OP_B_IDB_ADDR_SEL value specifies the addresses of the Operand B IDB buffers that will provide their contents as inputs for each of the SIMD engines. This signal was illustrated for Architecture 2B in
[0134] The use of multiple Operand B buffers 2222 in the input distribution block 2220 allows different SIMD engines to potentially get different Operand B data at a given cycle. Using four Operand B buffers 2222 (i.e., the same as the number of SIMD engines) allows four simultaneous reads, so that each SIMD receives different data. This is much less expensive (from a hardware perspective), than implementing the Operand B register file 2212 with a four port memory (which would also allow four simultaneous read operations to supply SIMD.sub.0-SIMD.sub.3). Providing Operand B buffers 2222 to buffer a small number of register words from the Operand B register files 2212 effectively provides a small cache that can be accessed by any of SIMD.sub.0-SIMD.sub.3. This presents a good compromise between hardware complexity and the required flexibility for some classes of algorithms.
[0135] The SIMD block 2230 includes one or more SIMD engines which perform the actual computations on the data provided by the input distribution block 2220. Because SIMD engines can support different type of operations, the operation to be performed should be provided as an input. Thus, the SIMD_OPERATION_SEL value is used to specify the operations to be performed by the SIMD engines. Theoretically, different SIMD engines can perform different operations, but in general, the same operation select value SIMD_OPERATION_SEL is used to drive all the SIMD engines.
[0136] The result of the computations performed by the SIMD engines need to be written to output register files 2241-2244 within output block 2240. Also, for operations like accumulation, previously accumulated values need to be read from the output register files 2241-2244 (and provided to the SIMD engines). Generally, the accumulated values are written back into the same location as the previously accumulated values. However, for the sake of generality, two control values OUTPUT_RF_ADDR_SEL_0 and OUTPUT_RF_ADDR_SEL_1 are provided to output block 2240, thereby allowing the read and write addresses of each of the output register files 2241-2244 to be specified separately. In one embodiment, the control value OUTPUT_RF_ADDR_SEL_0 specifies the write addresses to each of the output register files 2241-2244, and the control value OUTPUT_RF_ADDR_SEL_1 specifies the read addresses to each of the output register files 2241-2244. An illustration of specifying the output addresses was given using
[0137] Note that including multiple output registers in each of the output register files 2241-2244 advantageously provides flexibility with regard to the type of operations that can be performed by the described system architecture. Some examples of this flexibility are described in more detail below.
[0138] Various examples for operating a computer architecture in accordance with a particular embodiment of the present invention will now be described.
[0139]
[0140]
[0141] Matrix I and Matrix J are stored in system memory 440 (
[0142] Matrix J is stored in an Operand B memory block 442 that includes 16 rows, each row including four activation values. For example, the first row of Operand B memory block 442 includes activation values [d.sub.0, c.sub.0, b.sub.0, a.sub.0] included in the first row of Matrix J. The remaining rows (Row 1-Row 15) of Matrix J are stored in consecutive rows (Row 1-Row 15) of Operand B memory block 442.
[0143] The multiplication of Matrix I and Matrix J is performed as follows.
[0144] State machine and scheduler 431 (
[0145] Each of the SIMD engines (SIMD.sub.0-SIMD.sub.3) multiplies the corresponding entries of Operand A and Operand B (e.g., SIMD.sub.0 performs (a.sub.0×w.sub.0,0), (b.sub.0×w.sub.0,0), (c.sub.0×w.sub.0,0) and (d.sub.0×w.sub.0,0)) to generate corresponding products.
[0146]
[0147] State machine and scheduler 431 controls addressing of the output register sets 2000.sub.0, 2000.sub.1, 2000.sub.2 and 2000.sub.3 in parallel. During the initial calculation (described above and illustrated in
[0148] During the initial calculation, each of SIMD.sub.0, SIMD.sub.1, SIMD.sub.2 and SIMD.sub.3 performs accumulation operations, wherein the zero values retrieved from the output register sets 2000.sub.0-2000.sub.3 are added to the products calculated by SIMD.sub.0-SIMD.sub.3. The accumulated values are then written back to Row 0 of the corresponding output register sets 2000.sub.0-2000.sub.3.
[0149] For example, the zero values from the entries (w.sub.0,i d.sub.i), (w.sub.0,i c.sub.i), (w.sub.0,i b.sub.i) and (w.sub.0,i a.sub.i) of Row 0 of output register set 2000.sub.0 are provided to SIMD.sub.0. SIMD.sub.0 then adds the calculated products (w.sub.0,0×d.sub.0), (w.sub.0,0×c.sub.0), (w.sub.0,0×b.sub.0) and (w.sub.0,0×a.sub.0) to these retrieved zero values to create accumulated values. SIMD.sub.0 then writes these accumulated values back to the entries (w.sub.0,i d.sub.i), (w.sub.0,i c.sub.i), (w.sub.0,i b.sub.i) and (w.sub.0,i a.sub.i) of Row 0 of output register set 2000.sub.0. Similar operations are performed by SIMD.sub.1-SIMD.sub.3.
[0150] State machine and scheduler 431 then increments address used to access Operand A memory block 441, causing the next row of values (i.e., w.sub.4,0, w.sub.5,0, w.sub.6,0, and w.sub.7,0) to be retrieved and stored in Operand A register file 411. Operand A distribution circuit 416 routes these received values in the same manner described above in connection with
[0151] Each of the SIMD engines (SIMD.sub.0-SIMD.sub.3) multiplies the corresponding entries of Operand A and Operand B (e.g., SIMD.sub.0 performs (a.sub.0×w.sub.4,0), (b.sub.0×w.sub.4,0), (c.sub.0×w.sub.4,0) and (d.sub.0×w.sub.4,0)) thereby providing corresponding products.
[0152] During this second calculation, state machine and scheduler 431 increments the row address of each of the output register sets 2000.sub.0-2000.sub.3, thereby addressing Row 1 within each of these output register sets. As a result, the zero values stored in Row 1 of the output register sets 2000.sub.0-2000.sub.3 are provided to SIMD.sub.0-SIMD.sub.3.
[0153] During the second calculation, each of SIMD.sub.0, SIMD.sub.1, SIMD.sub.2 and SIMD.sub.3 performs accumulation operations, wherein the zero values retrieved from the output register sets 2000.sub.0-2000.sub.3 are added to the products calculated by the SIMD engines. The accumulated values are then written back to the output register sets 2000.sub.0-2000.sub.3.
[0154] For example, the zero values from the entries (w.sub.4,i d.sub.i), (w.sub.4,i c.sub.i), (w.sub.4,i b.sub.i) and (w.sub.4,i a.sub.i) of Row 1 of output register set 2000.sub.0 are provided to SIMD.sub.0. SIMD.sub.0 then adds the calculated products (w.sub.4,0×d.sub.0), (w.sub.4,0×c.sub.0), (w.sub.4,0×b.sub.0) and (w.sub.4,0×a.sub.0) to these retrieved zero values to create accumulated values. SIMD.sub.0 then writes these accumulated values back to the entries (w.sub.4,i d.sub.i), (w.sub.4,i c.sub.i), (w.sub.4,i b.sub.i) and (w.sub.4,i a.sub.i) of Row 1 of output register set 2000.sub.0. Similar operations are performed by SIMD.sub.1-SIMD.sub.3.
[0155] The above-described process is repeated until Operand A distribution circuit 416 sequentially routes all (64) of the weight values w.sub.0,0 to w.sub.63,0 from the first column (Col 0) of Matrix I to SIMD.sub.0-SIMD.sub.3 as Operand A values in the manner described above.
[0156] After the weight values from the first column (Col 0) of Matrix I have been used to perform multiply-accumulate operations (e.g., after products associated with values w.sub.0,0 to w.sub.63,0 have been calculated), state machine and scheduler 431 resets the addresses of output register sets 2000.sub.0-2000.sub.3 to Row 0. In addition, state machine and scheduler 431 increments the address used to access Operand A memory block 441, such that the values (w.sub.0,1, w.sub.i,1, w.sub.2,1, w.sub.3,1) are retrieved and stored in Operand A register file 411. Operand A distribution circuit 416 routes these values (w.sub.0,1, w.sub.1,1, w.sub.2,1, w.sub.3,1) to SIMD.sub.0, SIMD.sub.1, SIMD.sub.2 and SIMD.sub.3, respectively (in the same manner described above in connection with
[0157] State machine and scheduler 431 also increments the address used to access Operand B memory block 442 by one, such that values (a.sub.i, b.sub.i, c.sub.i, d.sub.i) from Row 1 of Operand B memory block 442 are retrieved and stored in Operand B register file 412. Operand B distribution circuit 417 routes these values (a.sub.i, b.sub.i, c.sub.i, d.sub.i) to each of SIMD.sub.0, SIMD.sub.1, SIMD.sub.2 and SIMD.sub.3 (in the same manner that values (a.sub.0, b.sub.0, c.sub.0, d.sub.0) were previously routed to SIMD.sub.0, SIMD.sub.1, SIMD.sub.2 and SIMD.sub.3 in
[0158] SIMD.sub.0, SIMD.sub.1, SIMD.sub.2 and SIMD.sub.3 perform multiply-accumulate operations on the received values, and the results are stored in Row 0 of the output registers 2000.sub.0, 2000.sub.1, 2000.sub.2 and 2000.sub.3, respectively, in the manner described above.
[0159] State machine and scheduler 431 then increments address used to access Operand A memory block 441, causing the next row of values (i.e., w.sub.4,0, w.sub.5,0, w.sub.6,0, and w.sub.7,0) to be retrieved and stored in Operand A register file 411. Operand A distribution circuit 416 routes these received values in the same manner described above in connection with
[0160] The above-described process is repeated until Operand A distribution circuit 416 sequentially routes all (64) of the weight values w.sub.0,1 to w.sub.63,1 from the second column (Col 1) of Matrix I to SIMD.sub.0-SIMD.sub.3 as Operand A values (while Operand B (a.sub.1, b.sub.1, c.sub.1, d.sub.1) remains unchanged).
[0161] The above-described process is then repeated, such that multiply-accumulate operations are performed for each of the columns (Col 0 to Col 15) of Matrix I and each of the rows (Row 0 to Row 15) of Matrix J. At the end of this process, the output register sets 2000.sub.0-2000.sub.3 store the dot product of each row of matrix I with each column of matrix J. For example, the entry (w.sub.0,i a.sub.i) of output register set 2000.sub.0 stores the dot product of Row 0 of matrix I and Col 0 of matrix J, and the entry (w.sub.29,i d.sub.i) of output register set 2000.sub.1 stores the dot product of Row 29 of matrix I and Col 3 of matrix J.
[0162] Advantageously, the present invention provides an efficient structure for multiplying matrix I and matrix J. SIMD.sub.0-SIMD.sub.3 provide a high degree of processing parallelism (e.g., sixteen parallel multiply-accumulate operations at a time), which advantageously reduces the time required to perform the matrix multiplication. In addition, the control circuitry required to implement the matrix multiplication is advantageously simple. Address inputs to Operand A memory block 441 and output register sets 2000.sub.0-2000.sub.3 are simply incremented after each multiply-accumulate operation, and the address input to Operand B memory block 442 is simply incremented after every 16 multiply-accumulate operations. Input distribution block 415 advantageously maintains the same configuration during the entire matrix multiplication.
[0163] A matrix that contains a large number of zero value entries is referred to as a ‘sparse’ matrix. For example, a matrix that includes ⅞ zero value entries or more may be referred to as a sparse matrix. Multiplication involving a sparse matrix may involve a large number of unnecessary operations. In the example provided above, assume that ⅞ of the entries of Matrix I include zero values. In this case, only 512 (16×16×16×(⅛)) multiply-accumulate operations are required to multiply matrix I and matrix J. However, all 4096 operations (16×16×16) described above would be performed by the method described above in connection with
[0164] Assume that Matrix I is a sparse matrix, wherein only one eighth of the entries of Matrix I have non-zero values. As described above, processing is sequentially performed for each column of Matrix I (e.g., column 0 of Matrix I is initially processed, followed by column 1 of Matrix I, etc.). Thus, the processing of the first column of Matrix I will be described, with the understanding that the remaining columns of Matrix I are processed in the same manner.
[0165] In a first example, it is assumed that only the following eight entries (of the 64 total entries) of Column 0 of Matrix I have non-zero values: w.sub.3,0, w.sub.5,0, w.sub.8,0, w.sub.10,0, w.sub.11,0, w.sub.24,0, w.sub.58,0, and w.sub.61,0.
[0166] Initially, operand packaging logic 433 identifies the row addresses of the non-zero values within Matrix I. Thus, in the present example, operand packing logic 433 determines that the non-zero values w.sub.3,0, w.sub.5,0, w.sub.8,0, w.sub.10,0, w.sub.11,0, w.sub.24,0, w.sub.58,0, and w.sub.61,0 are located in rows 3, 5, 8, 10, 11, 24, 58 and 61, respectively, of Matrix I. Using this row address information, operand packing logic 433 determines which of the output register sets 2000.sub.0-2000.sub.3 are used to store the dot products associated with the identified non-zero values. In general, this determination is made by dividing the row address of the non-zero value within Operand Matrix I by ‘4’, and then using the remainder (R) of this division operation to identify the output register set (wherein the remainder (R) identifies output register set 2000.sub.R).
[0167] Operand packing logic 433 also determines the row within the output register set where the dot product is stored. In general, this determination is made by dividing the row address of the non-zero value within Matrix I by ‘4’, and ignoring the remainder (R).
[0168] In the present example, non-zero values w.sub.8,0 and w.sub.24,0, are located in rows 8 and 24 of Matrix I. Dividing these row numbers by 4 result in remainders of ‘0’, thereby indicating that the dot products of non-zero values w.sub.8,0 and w.sub.24,0 are located in output register set 2000.sub.0. Moreover, because 8/4=2 and 24/4=6, the dot products of non-zero values w.sub.8,0 and w.sub.24,0 are located in Row 2 and Row 6, respectively, of output register set 2000.sub.0. This result is confirmed by
[0169] In the present example, non-zero values w.sub.5,0 and w.sub.61,0, are located in rows 5 and 61 of matrix I. Dividing these row numbers by 4 result in remainders of ‘1’, thereby indicating that the dot products of non-zero values w.sub.5,0 and w.sub.61,0 are located in output register set 2000.sub.1. Moreover, because 5/4=1 (remainder 1) and 61/4=15 (remainder 1), the dot products of non-zero values w.sub.5,0 and w.sub.61,0 are located in Row 1 and Row 15, respectively, of output register set 2000.sub.1. This result is confirmed by
[0170] In the present example, non-zero values w.sub.10,0 and w.sub.58,0, are located in rows 10 and 58 of matrix I. Dividing these row numbers by 4 results in remainders of ‘2’, thereby indicating that the dot products of non-zero values w.sub.10,0 and w.sub.58,0 are located in output register set 2000.sub.2. Moreover, because 10/4=2 (remainder 2) and 58/4=14 (remainder 2), the dot products of non-zero values w.sub.10,0 and w.sub.58,0 are located in Row 2 and Row 14, respectively, of output register set 2000.sub.2. This result is confirmed by
[0171] In the present example, non-zero values w.sub.3,0 and w.sub.11,0, are located in rows 3 and 11 of matrix I. Dividing these row numbers by 4 result in remainders of ‘3’, thereby indicating that the dot products of non-zero values w.sub.3,0 and w.sub.11,0 are located in output register set 2000.sub.3. Moreover, because 3/4=0 (remainder 3) and 11/4=2 (remainder 3), the dot products of non-zero values w.sub.3,0 and w.sub.11,0 are located in Row 0 and Row 2, respectively, of output register set 2000.sub.3. This result is confirmed by
[0172] Upon making the determinations specified above, operand packing logic 433 sorts (packs) the non-zero values w.sub.3,0, w.sub.5,0, w.sub.8,0, w.sub.10,0, w.sub.11,0, w.sub.24,0, w.sub.58,0, and w.sub.61,0 of Column 0 of matrix I into Operand A memory block 441 as follows. See new statement later in document.
[0173] The first non-zero values to have dot products stored in output register sets 2000.sub.0, 2000.sub.1, 2000.sub.2 and 2000.sub.3 are stored in the first row (Row 0) of Operand A memory block 441. Thus, in the present example, non-zero values w.sub.8,0, w.sub.5,0, w.sub.10,0 and w.sub.3,0, which have dot products in output register sets 2000.sub.0, 2000.sub.1, 2000.sub.2 and 2000.sub.3, respectively, are stored in Row 0 of Operand A memory block 441.
[0174] The next non-zero values to have dot products stored in output register sets 2000.sub.0, 2000.sub.1, 2000.sub.2 and 2000.sub.3 are stored in the second row (Row 1) of Operand A memory block 441. Thus, in the present example, non-zero values w.sub.24,0, w.sub.61,0, w.sub.58,0, and w.sub.11,0, which have dot products in output register sets 2000.sub.0, 2000.sub.1, 2000.sub.2 and 2000.sub.3, respectively, are stored in Row 1 of Operand A memory block 441. The above-described sorting/packing of the non-zero values of Column 0 of matrix I into the Operand A memory block 441 is illustrated in
[0175] Initially, the state machine and scheduler 431 causes the first rows of Operand A memory block 441 and Operand B memory block 442 to be retrieved and loaded into Operand A register file 411 and Operand B register file 412, respectively, as illustrated by
[0176] SIMD.sub.0, SIMD.sub.1, SIMD.sub.2 and SIMD.sub.3 multiply the received Operands A and B in the manner described above. State machine and scheduler 431 independently addresses the previously determined rows in output register sets 2000.sub.0, 2000.sub.1, 2000.sub.2 and 2000.sub.3 that are associated with the non-zero values w.sub.8,0, w.sub.5,0, w.sub.10,0 and w.sub.3,0. That is, state machine and scheduler 431 addresses Row 2, Row 1, Row 2 and Row 0 within output register sets 2000.sub.0, 2000.sub.1, 2000.sub.2 and 2000.sub.3, respectively. As described above, all rows of the output register sets store initially store ‘0’ values.
[0177] SIMD.sub.0, SIMD.sub.1, SIMD.sub.2 and SIMD.sub.3 perform accumulate operations, wherein the calculated products are added to the zero values retrieved from the output register sets 2000.sub.0, 2000.sub.1, 2000.sub.2 and 2000.sub.3, respectively. SIMD.sub.0, SIMD.sub.1, SIMD.sub.2 and SIMD.sub.3 then write the accumulated values to the addressed rows (Row 2, Row 1, Row 2 and Row 0, respectively) of the output register sets 2000.sub.0, 2000.sub.1, 2000.sub.2 and 2000.sub.3, respectively.
[0178] As illustrated by
[0179] Thus, Operand A register file 411 stores the non-zero weight values w.sub.24,0, w.sub.61,0, w.sub.58,0 and w.sub.11,0 of Matrix I, and Operand B register file 412 stores the activation values d.sub.0, c.sub.0, b.sub.0 and a.sub.0 of Matrix J. State machine and scheduler 431 causes Operand A distribution circuit 416 to route the non-zero values w.sub.24,0, w.sub.61,0, w.sub.58,0 and w.sub.11,0, from Operand A register file 411 to SIMD.sub.0, SIMD.sub.1, SIMD.sub.2 and SIMD.sub.3, respectively, as Operand A. At the same time, state machine and scheduler 431 continues to cause Operand B distribution circuit 417 to route the values d.sub.0, c.sub.0, b.sub.0 and a.sub.0 to each of the SIMD engines as Operand B. These values d.sub.0, c.sub.0, b.sub.0 and a.sub.0 are routed from Row 0 of the Operand B memory block 442 (i.e., Row 0 of Matrix J) because each of the Operand A values w.sub.24,0, w.sub.61,0, w.sub.58,0 and w.sub.11,0 are from Column 0 of Matrix I.
[0180] SIMD.sub.0, SIMD.sub.1, SIMD.sub.2 and SIMD.sub.3 multiply the received Operands A and B in the manner described above. State machine and scheduler 431 independently addresses the previously determined rows in output register sets 2000.sub.0, 2000.sub.1, 2000.sub.2 and 2000.sub.3 that are associated with the non-zero values w.sub.24,0, w.sub.61,0, w.sub.58,0 and w.sub.11,0. That is, state machine and scheduler 431 addresses Row 6, Row 15, Row 14 and Row 2 within output register sets 2000.sub.0, 2000.sub.1, 2000.sub.2 and 2000.sub.3, respectively.
[0181] SIMD.sub.0, SIMD.sub.1, SIMD.sub.2 and SIMD.sub.3 perform accumulate operations, wherein the calculated products are added to the zero values retrieved from the output register sets 2000.sub.0, 2000.sub.1, 2000.sub.2 and 2000.sub.3, respectively. SIMD.sub.0, SIMD.sub.1, SIMD.sub.2 and SIMD.sub.3 then write the accumulated values to the addressed rows (Row 6, Row 15, Row 14 and Row 2, respectively) of the output register sets 2000.sub.0, 2000.sub.1, 2000.sub.2 and 2000.sub.3, respectively.
[0182] The above-described operations complete the processing of the first column (Col 0) of Matrix I. The same processing steps are performed for each of the remaining 15 columns of Matrix I, thereby completing the multiplication of ‘sparse’ Matrix I and Matrix J. Note that in these processing steps, non-zero values in Column 1 of Matrix I are multiplied by values in Row 1 of Matrix J (i.e., a.sub.i, b.sub.i, c.sub.i and d.sub.i), non-zero values in Column 2 of Matrix I are multiplied by values in Row 2 of Matrix J (i.e., a.sub.2, b.sub.2, c2 and d.sub.2), etc.). Advantageously, the SIMD engines are kept busy (i.e., perform multiply-accumulate operations for non-zero matrix values), while minimizing the number of multiply-accumulate operations required to perform the multiplication of ‘sparse’ Matrix I and Matrix J. In the foregoing manner, the computer architecture performs multiplication of a sparse matrix in a highly efficient (and fast) manner.
[0183] The sparse matrix multiplication example described above includes non-zero values of Matrix I that are evenly distributed among the output register sets 2000.sub.0-2000.sub.3 (e.g., eight non-zero values in Column 0 of Matrix I are distributed such that each of the output register sets 2000.sub.0-2000.sub.3 is associated with two non-zero values). However, in other examples, the distribution of the non-zero values may not be as uniform. Another embodiment of the present invention handles a non-uniform distribution of non-zero values in a manner described in more detail below.
[0184] Suppose that the first sixteen non-zero entries in the first three columns of Matrix I are entries w.sub.2,0 w.sub.12,0, w.sub.32,0, w.sub.38,0, w.sub.43,0, w.sub.56,0 (in Col. 0 of Matrix I), w.sub.7,1, w.sub.14,1 w.sub.21,1, w.sub.25,1, w.sub.37,1, w.sub.43,1 (in Col. 1 of Matrix I), w.sub.8,2, w.sub.10,2, w.sub.23,2 and w.sub.51,2 (in Col. 2 of Matrix I).
[0185] Operand packing logic 433 identifies the row addresses of the non-zero values within Matrix I (e.g., non-zero entry w.sub.2,0 is located in row 2 of Matrix I). Using this row address information, operand packing logic 433 determines which of the output register sets 2000.sub.0-2000.sub.3 are used to store the dot products associated with the identified non-zero values in the manner described above. Operand packing logic 433 also determines the row within the output register set where the dot product is stored, in the manner described above.
[0186] Thus, in the present example, operand packing logic 433 determines that the dot products associated with non-zero entries w.sub.12,0, w.sub.32,0, w.sub.56,0 and w.sub.8,2 are mapped to rows 3, 8, 14 and 2, respectively, of output register set 2000.sub.0; the dot products associated with non-zero entries w.sub.45,0, w.sub.21,0, w.sub.25,0 and w.sub.37,2 are mapped to rows 11, 5, 6 and 9, respectively, of output register set 2000.sub.1; the dot products associated with non-zero entries w.sub.2,0, w.sub.38,0, w.sub.14,1 and w.sub.10,2 are mapped to rows 0, 9, 3 and 2, respectively, of output register set 2000.sub.2; and the dot products associated with non-zero entries w.sub.7,1, w.sub.43,1, w.sub.23,2 and w.sub.51,2 are mapped to rows 7, 10, 5 and 12, respectively, of output register set 2000.sub.3.
[0187] Note that three non-zero entries (w.sub.12,0, w.sub.32,0 and w.sub.56,0) of column 0 of Matrix I are mapped to output register set 2000.sub.0, one non-zero entry (w.sub.45,0) of column 0 of Matrix I is mapped to output register set 2000.sub.1, two non-zero entries (w.sub.2,0 and w.sub.38,0) of column 0 of Matrix I are mapped to output register set 2000.sub.2, and no non-zero entry of column 0 of Matrix I is mapped to output register set 2000.sub.3.
[0188] No non-zero entries of column 1 of Matrix I are mapped to output register set 2000.sub.0, three non-zero entries (w.sub.21,1, w.sub.25,1, w.sub.37,1) of column 1 of Matrix I are mapped to output register set 2000.sub.1, one non-zero entry (w.sub.14,1) of column 1 of Matrix I is mapped to output register set 2000.sub.2, and two non-zero entries (w.sub.7,1 and w.sub.43,1) of column 0 of Matrix I is mapped to output register set 2000.sub.3.
[0189] One non-zero entry (w.sub.8,2) of column 2 of Matrix I is mapped to output register set 2000.sub.0, no non-zero entries of column 2 of Matrix I are mapped to output register set 2000.sub.1, one non-zero entry (w.sub.10,2) of column 2 of Matrix I is mapped to output register set 2000.sub.2, and two non-zero entries (w.sub.23,2 and w.sub.51,2) of column 2 of Matrix I is mapped to output register set 2000.sub.3.
[0190] Upon making the determinations specified above, operand packing logic 433 sorts (packs) the non-zero values of columns 0, 1 and 2 of Matrix I into Operand A memory block 441 as follows. The first non-zero values to have dot products stored in output register sets 2000.sub.0, 2000.sub.1, 2000.sub.2 and 2000.sub.3 are stored in the first row (Row 0) of Operand A memory block 441. Thus, in the present example, non-zero values w.sub.12,0, w.sub.45,0, w.sub.2,0 and w.sub.7,1, which have dot products in output register sets 2000.sub.0, 2000.sub.1, 2000.sub.2 and 2000.sub.3, respectively, are stored in Row 0 of Operand A memory block 441.
[0191] The next non-zero values to have dot products stored in output register sets 2000.sub.0, 2000.sub.1, 2000.sub.2 and 2000.sub.3 are stored in the second row (Row 1) of Operand A memory block 441. Thus, in the present example, non-zero values w.sub.32,0, w.sub.21,1, w.sub.38,0, and w.sub.23,2, which have dot products in output register sets 2000.sub.0, 2000.sub.1, 2000.sub.2 and 2000.sub.3, respectively, are stored in Row 1 of Operand A memory block 441.
[0192] The next non-zero values to have dot products stored in output register sets 2000.sub.0, 2000.sub.1, 2000.sub.2 and 2000.sub.3 are stored in the third row (Row 2) of Operand A memory block 441. Thus, in the present example, non-zero values w.sub.8,2, w.sub.37,1, w.sub.10,2, and w.sub.43,1, which have dot products in output register sets 2000.sub.0, 2000.sub.1, 2000.sub.2 and 2000.sub.3, respectively, are stored in Row 2 of Operand A memory block 441.
[0193] The above-described sorting/packing of the non-zero values of Columns 0, 1 and 2 of Matrix I into the Operand A memory block 441 is illustrated in
[0194] Initially, the state machine and scheduler 431 causes Row 0 of Operand A memory block 441 to be retrieved and loaded into Operand A register file 411, and then transferred to Operand A distribution circuit 416. Thus, Operand A register file 411 stores the non-zero weight values w.sub.12,0, w.sub.43,0, w.sub.2,0 and w.sub.7,1 of Matrix I. Note that in an alternate embodiment, these non-zero weight values w.sub.12,0, w.sub.45,0, w.sub.2,0 and w.sub.7,1 are stored in a buffer within Operand A distribution circuit 416.
[0195] State machine and scheduler 431 also causes Row 0 and Row 1 of Operand B memory block 442 to be retrieved and loaded into Operand B register file 412, and then transferred into Operand B buffers BC and B1, respectively, within Operand B distribution circuit 417. Thus, Operand B register file 412 and Operand B buffer BC store the activation values d.sub.0, c.sub.0, b.sub.0 and a.sub.0 of Matrix J, and Operand B register file 412 and Operand B buffer B1 store the activation values d.sub.1, c.sub.1, b.sub.1 and a.sub.1 of Matrix J. This condition is shown in
[0196] Note that state machine and scheduler 431 retrieves the activation values from Row 0 and Row 1 of Operand B memory block 432 because these two activation values are required to calculate the required dot products associated with the retrieved weight values included in Operand A (which were taken from the first two columns of Matrix I). Also note that Operand B register file 412 can be loaded in series or parallel from Operand B memory block 442, and that the buffers B0-B3 of Operand B distribution circuit 417 can be loaded in series (
[0197] State machine and scheduler 431 causes Operand A distribution circuit 416 to route the non-zero values w.sub.12,0, w.sub.43,0, w.sub.2,0, and w.sub.7,1 from Operand A register file 411 to SIMD.sub.0, SIMD.sub.1, SIMD.sub.2 and SIMD.sub.3, respectively, as Operand A. At the same time, state machine and scheduler 431 causes Operand B distribution circuit 417 to route the values d.sub.0, c.sub.0, b.sub.0 and a.sub.0 to each of SIMD.sub.0, SIMD.sub.1 and SIMD.sub.2 as Operand B, and also causes Operand B distribution circuit 417 to route the values d.sub.i, c.sub.i, b.sub.i and a.sub.i to SIMD.sub.3. In the embodiment illustrated by
[0198] SIMD.sub.0, SIMD.sub.1, SIMD.sub.2 and SIMD.sub.3 multiply the received Operands A and B in the manner described above. State machine and scheduler 431 independently addresses the previously determined rows in output register sets 2000.sub.0, 2000.sub.1, 2000.sub.2 and 2000.sub.3 that are associated with the non-zero values w.sub.12,0, w.sub.43,0, w.sub.2,0 and w.sub.7,1. That is, state machine and scheduler 431 addresses Row 4, Row 11, Row 0 and Row 1 within output register sets 2000.sub.0, 2000.sub.1, 2000.sub.2 and 2000.sub.3, respectively. As described above, all rows of the output register sets store initially store ‘0’ values.
[0199] SIMD.sub.0, SIMD.sub.1, SIMD.sub.2 and SIMD.sub.3 perform accumulate operations, wherein the calculated products are added to the zero values retrieved from the output register sets 2000.sub.0, 2000.sub.1, 2000.sub.2 and 2000.sub.3, respectively. SIMD.sub.0, SIMD.sub.1, SIMD.sub.2 and SIMD.sub.3 then write the accumulated values to the addressed rows (Row 4, Row 11, Row 0 and Row 1, respectively) of the output register sets 2000.sub.0, 2000.sub.1, 2000.sub.2 and 2000.sub.3, respectively. In the embodiment illustrated by
[0200] Note that the multiply-accumulate operations implemented in
[0201] As illustrated by
[0202] State machine and scheduler 431 also causes Row 2 of Operand B memory block 442 to be retrieved and loaded into Operand B register file 412, and then transferred into Operand B buffer B2 within Operand B distribution circuit 417. Thus, Operand B register file 412 and Operand B buffer BC store the activation values d.sub.0, c.sub.0, b.sub.0 and a.sub.0 of Matrix J, Operand B register file 412 and Operand B buffer B1 store the activation values d.sub.1, c.sub.1, b.sub.1 and a.sub.1, and Operand B register file 412 and Operand B buffer B2 store the activation values d.sub.2, c.sub.2, b.sub.2 and a.sub.2.
[0203] Note that state machine and scheduler 431 retrieves the activation values from Rows 0, 1 and 2 of Operand B memory block 432 because these three activation values are required to calculate the required dot products associated with the retrieved weight values included in Operand A (which were taken from the first three columns of Matrix I).
[0204] State machine and scheduler 431 causes Operand A distribution circuit 416 to route the non-zero values w.sub.32,0, w.sub.21,1, w.sub.38,0 and w.sub.23,2 from Operand A register file 411 to SIMD.sub.0, SIMD.sub.1, SIMD.sub.2 and SIMD.sub.3, respectively, as Operand A. At the same time, state machine and scheduler 431 causes Operand B distribution circuit 417 to route the values d.sub.0, c.sub.0, b.sub.0 and a.sub.0 to each of SIMD.sub.0 and SIMD.sub.2 as Operand B, causes Operand B distribution circuit 417 to route the values d.sub.i, c.sub.i, b.sub.i and a.sub.i to SIMD.sub.1, and causes Operand B distribution circuit 417 to route the values d.sub.2, c.sub.2, b.sub.2 and a.sub.2 to SIMD.sub.3.
[0205] SIMD.sub.0, SIMD.sub.1, SIMD.sub.2 and SIMD.sub.3 multiply the received Operands A and B in the manner described above. State machine and scheduler 431 independently addresses the previously determined rows in output register sets 2000.sub.0, 2000.sub.1, 2000.sub.2 and 2000.sub.3 that are associated with the non-zero values w.sub.32,0, w.sub.21,1, w.sub.38,0 and w.sub.23,2. That is, state machine and scheduler 431 addresses Row 8, Row 5, Row 9 and Row 5 within output register sets 2000.sub.0, 2000.sub.1, 2000.sub.2 and 2000.sub.3, respectively. As described above, all rows of the output register sets store initially store ‘0’ values.
[0206] SIMD.sub.0, SIMD.sub.1, SIMD.sub.2 and SIMD.sub.3 perform accumulate operations, wherein the calculated products are added to the zero values retrieved from the output register sets 2000.sub.0, 2000.sub.1, 2000.sub.2 and 2000.sub.3, respectively. SIMD.sub.0, SIMD.sub.1, SIMD.sub.2 and SIMD.sub.3 then write the accumulated values to the addressed rows (Row 8, Row 5, Row 9 and Row 5, respectively) of the output register sets 2000.sub.0, 2000.sub.1, 2000.sub.2 and 2000.sub.3, respectively.
[0207] Note that the multiply-accumulate operations implemented in
[0208] As illustrated by
[0209] The activation values already stored in Operand B buffers B1-B2 of Operand B distribution are used in multiply-accumulate operations associated with the non-zero weight values w.sub.8,2, w.sub.37,1, w.sub.10,2 and w.sub.43,1 of Matrix I.
[0210] State machine and scheduler 431 causes Operand A distribution circuit 416 to route the non-zero values w.sub.8,2, w.sub.37,1, w.sub.10,2 and w.sub.43,1 from Operand A register file 411 to SIMD.sub.0, SIMD.sub.1, SIMD.sub.2 and SIMD.sub.3, respectively, as Operand A. At the same time, state machine and scheduler 431 causes Operand B distribution circuit 417 to route the values d.sub.2, c2, b.sub.2 and a2 to each of SIMD.sub.0 and SIMD.sub.2 as Operand B, and causes Operand B distribution circuit 417 to route the values d.sub.i, c.sub.i, b.sub.i and a.sub.i to SIMD.sub.1 and SIMD.sub.3.
[0211] SIMD.sub.0, SIMD.sub.1, SIMD.sub.2 and SIMD.sub.3 multiply the received Operands A and B in the manner described above. State machine and scheduler 431 independently addresses the previously determined rows in output register sets 2000.sub.0, 2000.sub.1, 2000.sub.2 and 2000.sub.3 that are associated with the non-zero values w.sub.8,2, w.sub.37,1, w.sub.10,2 and w.sub.43,1. That is, state machine and scheduler 431 addresses Row 2, Row 9, Row 2 and Row 10 within output register sets 2000.sub.0, 2000.sub.1, 2000.sub.2 and 2000.sub.3, respectively. As described above, all rows of the output register sets store initially store ‘0’ values.
[0212] SIMD.sub.0, SIMD.sub.1, SIMD.sub.2 and SIMD.sub.3 perform accumulate operations, wherein the calculated products are added to the zero values retrieved from the output register sets 2000.sub.0, 2000.sub.1, 2000.sub.2 and 2000.sub.3, respectively. SIMD.sub.0, SIMD.sub.1, SIMD.sub.2 and SIMD.sub.3 then write the accumulated values to the addressed rows (Row 2, Row 9, Row 2 and Row 10, respectively) of the output register sets 2000.sub.0, 2000.sub.1, 2000.sub.2 and 2000.sub.3, respectively.
[0213] Note that the multiply-accumulate operations implemented in
[0214] Although the processing of only three columns of sparse Matrix I are described in the example of
[0215] Although operand packing logic 433 is shown as being a part of control logic 430 in the embodiments described above, it is understood that in an alternate embodiment, the functionality of operand packing logic 433 can be implemented external to system 400. In such an alternate embodiment, software can be used to identify the non-zero values of Matrix I (because the weight values for a network, as represented by the entries of Matrix I, are known), determine the output registers (and output register row addresses) associated with these non-zero values, identify the addresses of the values of the Matrix J required to perform the multiply-accumulate operations with the non-zero values of Matrix I, and determine the manner in which the non-zero values of Matrix I should be packed within the Operand A register file 411. Methods for performing these determinations are described in detail above. The packed Operand A values can then be loaded directly into Operand A register file 411 (and/or system memory 440). The addresses required to load and access Operand B register file 412 and the addresses required to access the output registers 2000.sub.0-2000.sub.3 can be loaded into state machine and scheduler 431. State machine and scheduler 431 then simply retrieves the non-zero values from memory and supplies the required address signals during runtime, without any extra hardware complexity. In this manner, this alternate embodiment advantageously reduces the hardware requirements of system 400.
[0216] Although the invention has been described in connection with several embodiments, it is understood that this invention is not limited to the embodiments disclosed, but is capable of various modifications, which would be apparent to a person skilled in the art. Accordingly, the present invention is limited only by the following claims.