Programmable Spatial Array for Matrix Decomposition
20230297538 · 2023-09-21
Inventors
Cpc classification
International classification
Abstract
Programmable spatial array processing circuitry may be programmable to perform multiple different types of matrix decompositions. The programmable spatial array processing circuitry may include an array of processing elements. When programmed with a first instructions, the array performs a first type of matrix decomposition. When programmed with second instructions, the array performs a second type of matrix decomposition. Individual processing elements of the programmable spatial array processing circuitry may avoid having individual instruction memories. Instead, there may be an instruction memory that provides a portion of the first instructions or a portion of the second instructions sequentially to one processing element of a row of processing elements to sequentially propagate to other processing elements of the row of processing elements.
Claims
1. Circuitry comprising: programmable spatial array processing circuitry comprising: a processing element array having an array of processing elements, wherein the processing element array: when programmed with a first instructions, performs a first type of matrix decomposition; and when programmed with second instructions, performs a second type of matrix decomposition; and instruction memory that provides a portion of the first instructions or a portion of the second instructions sequentially to one processing element of a row of processing elements to sequentially propagate to other processing elements of the row of processing elements.
2. The circuitry of claim 1, wherein the array of processing elements has a triangular arrangement.
3. The circuitry of claim 1, wherein the array of processing elements comprises at least three different types of processing elements.
4. The circuitry of claim 3, wherein the at least three different types of processing elements comprise: a first diagonal processing element type that only receives data from above it in the array and only outputs data to a first side in the array; a second internal processing element type that only receives data from above it in the array and from a second side of it in the array and only outputs data to the first side of it in the array and down from it in the array; and a third vector processing element type that that only receives data from above it in the array and to from the second side of it in the array and outputs results of the processing element array.
5. The circuitry of claim 3, wherein the instruction memory comprises: a first instruction memory that stores instructions for all of the processing elements of a first type of the at least three different types of processing elements; a second instruction memory that stores instructions for all of the processing elements of a second type of the at least three different types of processing elements; and a third instruction memory that stores instructions for all of the processing elements of a third type of the at least three different types of processing elements.
6. The circuitry of claim 1, wherein the array of processing elements comprise: a first processing element having a first architecture that includes a plurality of issue slots; and a second processing element having a second architecture different from the first architecture that includes an arithmetic logic unit (ALU) configurable to perform multiply-accumulate operations.
7. The circuitry of claim 1, wherein the one processing element of the row of processing elements and the other processing elements of the row of processing elements do not contain instruction memory to hold more than one instruction at a time.
8. The circuitry of claim 1, wherein the first type of matrix decomposition comprises at least one of Cholesky decomposition, LU decomposition, Cholesky-based minimum mean square error (MMSE), Givens-Rotation QR based MMSE, and Gram-Schmidt QR decomposition, and wherein the second type of matrix decomposition comprises a different one of Cholesky decomposition, LU decomposition, Cholesky-based minimum mean square error (MMSE), Givens-Rotation QR based MMSE, and Gram-Schmidt QR decomposition.
9. The circuitry of claim 1, comprising host processing circuitry configurable to perform a task that involves performing the first type of matrix decomposition and cause the processing element array to be programmed with the first instructions.
10. The circuitry of claim 9, comprising a plurality of antennas, wherein the host processing circuitry and the programmable spatial array processing circuitry are configured to perform the first type of matrix decomposition or the second type of matrix decomposition to carry out a multiple-input multiple-output (MIMO) computation.
11. An article of manufacture comprising one or more tangible, non-transitory, machine readable media comprising instructions that, when executed by processing circuitry, cause the processing circuitry to: instruct a triangular spatial array of processing elements to perform a first matrix decomposition; and instruct the triangular spatial array of processing elements to perform a second matrix decomposition.
12. The article of manufacture of claim 11, wherein the instructions cause the processing circuitry to instruct the triangular spatial array of processing elements to perform the first matrix decomposition on a batch of matrices.
13. The article of manufacture of claim 12, wherein the instructions cause the processing circuitry to: instruct a first buffer to send the batch of matrices to the triangular spatial array of processing elements to be processed at least partly in parallel; and instruct a delay alignment buffer to realign resulting matrices output by the triangular spatial array of processing elements.
14. The article of manufacture of claim 11, wherein the instructions cause the processing circuitry to perform Cholesky decomposition, LU decomposition, Cholesky-based minimum mean square error (MMSE), Givens-Rotation QR based MMSE, or Gram-Schmidt QR decomposition as the first matrix decomposition and perform a different one of Cholesky decomposition, LU decomposition, Cholesky-based minimum mean square error (MMSE), Givens-Rotation QR based MMSE, or Gram-Schmidt QR decomposition as the second matrix decomposition.
15. The article of manufacture of claim 11, wherein the instructions cause the processing circuitry to send an instruction having a data structure that includes an indication of a function to be performed and a time-to-live value, wherein the time-to-live value causes that instruction to be ignored once that instruction has propagated through processing elements of the processing element array more than indicated by the time-to-live value.
16. An integrated circuit comprising: a buffer that outputs a batch of matrices; and a processing element array comprising a plurality of processing elements that sequentially propagate instructions to adjacent processing elements rather than store all the instructions in respective instruction memories, wherein the processing element array is programmable to: perform a first matrix decomposition on the batch of matrices when provided a first set of the instructions; and perform a second matrix decomposition on the batch of matrices when provided a second set of the instructions.
17. The integrated circuit of claim 16, wherein the processing element array performs the first matrix decomposition or the second matrix decomposition on at least two of the batch of matrices at least partly in parallel.
18. The integrated circuit of claim 16, wherein the processing element array comprises a processing element having a plurality of issue slots and a plurality of register files, wherein the at least two of the plurality of issue slots are programmable to operate in parallel to perform part of the first matrix decomposition or the second matrix decomposition.
19. The integrated circuit of claim 16, wherein the processing element array outputs results from the first matrix decomposition or the second matrix decomposition staggered in time such that results corresponding to a first matrix of the batch of matrices overlap in time with results corresponding to a second matrix of the batch of matrices.
20. The integrated circuit of claim 19, comprising a delay alignment buffer that aligns the results corresponding to the first matrix of the batch of matrices in time and aligns the results corresponding to the second matrix of the batch of matrices in time.
21. A method comprising: using a triangular spatial array of processing elements to perform a first type of matrix decomposition; and using the same triangular spatial array of processing elements to perform a second type of matrix decomposition.
22. The method of claim 21, wherein using the triangular spatial array of processing elements to perform the first type of matrix decomposition comprises: using a first buffer to send a batch of matrices to the triangular spatial array of processing elements to be processed at least partly in parallel; and using a second buffer to realign resulting matrices output by the triangular spatial array of processing elements.
23. (canceled)
24. (canceled)
25. (canceled)
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0006] Various aspects of this disclosure may be better understood upon reading the following detailed description and upon reference to the drawings in which:
[0007]
[0008]
[0009]
[0010]
[0011]
[0012]
[0013]
[0014]
[0015]
[0016]
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
[0029]
[0030]
[0031]
[0032]
[0033]
[0034]
[0035]
[0036]
[0037]
[0038]
DETAILED DESCRIPTION OF SPECIFIC EMBODIMENTS
[0039] One or more specific embodiments will be described below. In an effort to provide a concise description of these embodiments, not all features of an actual implementation are described in the specification. It should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another. Moreover, it should be appreciated that such a development effort might be complex and time consuming, but would nevertheless be a routine undertaking of design, fabrication, and manufacture for those of ordinary skill having the benefit of this disclosure.
[0040] When introducing elements of various embodiments of the present disclosure, the articles “a,” “an,” and “the” are intended to mean that there are one or more of the elements. The terms “including” and “having” are intended to be inclusive and mean that there may be additional elements other than the listed elements. Additionally, it should be understood that references to “some embodiments,” “embodiments,” “one embodiment,” or “an embodiment” of the present disclosure are not intended to be interpreted as excluding the existence of additional embodiments that also incorporate the recited features. Furthermore, the phrase A “based on” B is intended to mean that A is at least partially based on B. Moreover, the term “or” is intended to be inclusive (e.g., logical OR) and not exclusive (e.g., logical XOR). In other words, the phrase A “or” B is intended to mean A, B, or both A and B. Moreover, this disclosure describes various data structures, such as instructions for an instruction set architecture. These are described as having certain domains (e.g., fields) and corresponding numbers of bits. However, it should be understood that these domains and sizes in bits are meant as examples and are not intended to be exclusive. Indeed, the data structures (e.g., instructions) of this disclosure may take any suitable form.
[0041] An integrated circuit, such as an application specific integrated circuit (ASIC) or a programmable logic device (PLD) like a field programmable gate array (FPGA), may be part of an electronic device that perform wireless communications, machine learning, or many other tasks. These tasks may involve performing matrix decompositions. Indeed, matrix decomposition is widely used in wireless communication, machine learning, and other areas. For instance, multiple-input multiple-output (MIMO) wireless communication in 5G wireless systems, multivariate linear regressions in machine learning, systems of linear equations, matrix inversions and determinant calculations, and many others involve performing matrix decompositions. Different types of matrix decompositions include LU decomposition, QR decomposition, and Cholesky decomposition.
[0042] In contrast to single-purpose architectures that may support only one type of matrix decomposition, this disclosure provides a programmable spatial array processor that can be programmed to compute a variety of different types of matrix decompositions. The programmable spatial array processor has a two-dimensional upper triangular Processing Element (PE) array which acts as a high throughput engine. Every PE executes under instructions that provide programmability to support different modes.
[0043] As noted above, matrix decompositions are more complicated than matrix multiplication. The latter may generally use multiplication and addition operations and may have little or no data dependency among operations. Matrix decompositions, on the other hand, may have many data dependencies. This may cause one operation to have to wait for the result of another operation to be ready, which makes it difficult to handle data in parallel. Moreover, matrix decomposition usually has arithmetic operations other than multiplication, such as division and square root.
[0044] The programmable spatial array processor of this disclosure may use a control scheme that can mitigate the challenges of the data dependency of the various PEs in solving matrix decompositions. To solve this problem, an Instruction Share and Propagation (ISP) scheme may control all PEs efficiently. Instructions may be shared by certain PEs and propagated through them. This may substantially reduce the size or complexity of the instruction memory. Indeed, instructions may flow through the array in a systolic-like way, just like the data flow. All non-diagonal PEs may share the same instructions. This may (a) reduce instruction memory from N.sup.2/2 to 2 and (b) allow instructions to transfer between adjacent PEs so that a long control path may be avoided. Furthermore, the programmability of the programmable spatial array processor may enable a fast switch between two different types of matrix operation. The array of the programmable spatial array processor may simply be fed with new instructions for new matrix operation. Additional reset or reconfiguration time may be avoided, enabling transitions to computing different types of matrix decomposition to occur rapidly and seamlessly.
[0045] In addition to matrix decompositions, the programmable spatial array processor may also support widely used matrix operations like back substitution, matrix-vector multiplication, matrix multiplying by its transpose (A.sup.TA), and so on. The programmability even empowers it to perform customized functions. What is more, the programmable spatial array processor may have a triangular arrangement that, compared to a square array, may cut hardware resource usage nearly in half.
[0046] With this in mind,
[0047] Designers may implement their high-level designs using design software 14, such as a version of Intel® Quartus® Prime by INTEL CORPORATION. The design software 14 may use a compiler 16 to convert the high-level program into a lower-level description. The compiler 16 may provide machine-readable instructions representative of the high-level program to a host 18 and the integrated circuit device 12. The host 18 may include any suitable processing circuitry and may receive a host program 22 which may be implemented by the kernel programs 20. To implement the host program 22, the host 18 may communicate instructions from the host program 22 to the integrated circuit device 12 via a communications link 24, which may be, for example, direct memory access (DMA) communications or peripheral component interconnect express (PCIe) communications. While the techniques described above refer to the application of a high-level program, in some embodiments, a designer may use the design software 14 to generate and/or to specify a low-level program, such as the low-level hardware description languages described above. Further, in some embodiments, the system 10 may be implemented without a separate host program 22. Moreover, in some embodiments, the techniques described herein may be implemented in circuitry as hardened IP that is not programmed into a programmable logic device. Thus, embodiments described herein are intended to be illustrative and not limiting.
[0048] In some embodiments, the kernel programs 20 may enable configuration of a programmable spatial array processor 26 on the integrated circuit device 12. Indeed, the programmable spatial array processor 26 may represent a circuit design of the kernel program 20 that is configured onto the integrated circuit device 12 (e.g., formed in soft logic). In some embodiments, the programmable spatial array processor 26 may be partially or fully formed in hardened circuitry (e.g., application-specific circuitry of the integrated circuit 12 that is not configurable as programmable logic). The host 18 may use the communication link 24 to cause the programmable spatial array processor 26 to decompose matrices according to any suitable matrix decomposition type. For example, the programmable spatial array processor 26 may be used to perform matrix decomposition to detect or transmit a signal for multiple-input multiple-output (MIMO) communication via antennas 28.
[0049] The programmable spatial array processor 26 may be component included in a data processing system 40, as shown in
[0050] In one example, the data processing system 40 may be part of a data center that processes a variety of different requests. For instance, the data processing system 40 may receive a data processing request via the network interface 46 to perform encryption, decryption, machine learning, video processing, voice recognition, image recognition, data compression, database search ranking, bioinformatics, network security pattern identification, spatial navigation, digital signal processing, or some other specialized task. Some or all of the components of the data processing system 40 may be virtual machine components running on physical circuitry (e.g., managed by one or more hypervisors or virtual machine managers). Whether physical components or virtual machine components, the various components of the data processing system 40 may be located in the same location or different locations (e.g., on different boards, in different rooms, at different geographic locations). Indeed, the data processing system 40 may be accessible via a computing service provider (CSP) that may provide an interface to customers to use the data processing system 40 (e.g., to run programs and/or perform acceleration tasks) in a cloud computing environment.
High-Level Architecture of Programmable Spatial Array Processor
[0051]
[0052] The input data 68 may take any suitable form, including a matrix or vector format with throughput of one matrix row (column) per clock cycle. A block of the input data 68 may contain a batch of matrices to utilize the pipeline capability of PE array 76 and improve average throughput. Any suitable quantity of matrices or vectors may be used in a batch (e.g., 2, 3, 4, 5, 6, 7, 8, 16, 32, 64, 100, 128, 200, 256, 500, 512, 1000, 1024, or more or fewer). For instance, 32 consecutive matrices may form a batch, in this case the batch size is 32.
[0053] For example, as shown in
Processing Element (PE) Array
[0054] The core part of the programmable spatial array processor 26 is the two-dimensional processing element (PE) array 76. As shown in
[0055] The M PEs 112 mainly perform multiplication and accumulation (MAC) operations, and the M PEs 112 form the upper triangular part of a square N-by-N array, of which any suitable number N may define the array. The M PEs 112 may be considered an internal processing element type of the processing element array 76, since they are bounded to the left and right by the D PEs 110 and the V PEs 114. Multiplication and accumulation (MAC) operations are abundant in matrix operations. The V PEs 114 located at the rightmost column handle vector-related operations like matrix-vector multiplication. The V PEs 114 may have the same or a similar internal hardware structure as the M PEs 112. The main difference between the V PEs 114 and the M PEs 112 is that they run under different instructions (with different behaviors). The D PEs 110 may include more compute resources than the M PEs 112, since the diagonal elements may perform more complicated computations than non-diagonal elements in most matrix decomposition cases. As discussed further below, the D PEs 110 may include some MAC units and other math function (such as inverse square root) units, or may include units that perform certain specific operations.
[0056] The PE array 76 structure may achieve a relatively high operating clock frequency, since each PE 110, 112, or 114 may only connect with adjacent PEs 110, 112, or 114. This means that there may be no long routing path or that the routing paths between PEs 110, 112, and 114 may be sufficiently similar so as to have similar (e.g., equal) latencies. And this structure may relatively easily scale up to a large array size.
[0057]
[0058] Example architectures of the PEs 110, 112, and 114 will be described below. It should be appreciated that these are intended to be illustrative and not exhaustive. Indeed, the PEs 110, 112, and 114 may take any suitable form and have any suitable architectures.
[0059] Multiply-accumulate (M) PE 112 Architecture. One example architecture of an M PE 112 appears in
[0065] The ALU 164 may perform arithmetic operations such as add, multiply, multiply-add, multiply-accumulate, and so on. It may be implemented in complex form (named CMAC or CALU) to support complex number arithmetic that is widely used in wireless communication systems. The inputs of the ALU 164 can have multiple sources, such as input ports, the register file (RF) 172, or the data queue 174. The input and output interfaces shown in
Input:
[0066] in_instr: input instruction [0067] L_at: left data in (path of original data) [0068] in_data: input data (path of result data) [0069] U_dat: data from upper side [0070] U_val: validation of U_dat
Output:
[0071] out_instr: output instruction (propagates in_instr to next PE) [0072] R_dat: right data out (path of original data) [0073] out_dat: output data (path of result data) [0074] D_dat: data to downwards [0075] D_val: validation of D_dat
[0076] The data queue 174 is used to buffer upper input data, since the left input data may be delayed after upper input data. One way to handle this delay gap is to input the input data in a staggered way, as shown in
[0077] It can be observed that the data queue method shown in
[0078] Diagonal (D) PE 110 Architecture. Since the D PEs 110 may handle more complicated calculations than an M PE 112, the D PEs 110 may have more functional units. In an example, shown in
[0079] In the example architecture of the D PE 110 shown in
Other operations like square root and division can be calculated using Isqrt result
[0084] Multiple issues in a D PE 110 can work in a pipelined manner to achieve high throughput. Take Cholesky decomposition, for example. The process includes inverse square root (Isqrt) from the Isqrt slot 196 and multiplications in the issue slot 198, which use the result from the Isqrt slot 196. Using this pipeline scheme, the two issue slots 196 and 198 can work in parallel. An example is shown in
Instruction Share and Propagation
[0085] As previously discussed with respect to
[0086] Accordingly, a scheme referred to as Instruction Share and Propagation (ISP) may overcome some of the challenges mentioned above (e.g., avoiding such high fan-out and high memory utilization problems). The design of Instruction Share and Propagation (ISP) is made possible because the M PEs 112 and the D PEs 110 generally respectively execute the same or similar programs with only time offset and slight code differences. For instance, in a Cholesky decomposition procedure, every M PE 112 may execute the same first instruction but at different start times, and almost the same remaining instructions except that some of them may be ignored, as shown in
[0087] As shown in
[0088] As can be seen, the start time of instruction execution of each M PE 112 is different. As such, the delay of instruction arrival to each M PE 112 will be different and varies among functions. For example, the instruction delay between two adjacent M PEs 112 in one row may be 1 or 2 cycles (or more, as desired). The instruction delay between two adjacent rows of M PEs 112 could be many more cycles. As shown in
[0089] There may also be a Time to Live (TTL) domain in each instruction indicating whether this instruction should be executed, as shown in
[0090]
[0091]
[0092]
[0093] Thus, input data with length of N in the form of one row or column of a matrix may be fed into N FIFOs 310, and data read from the N FIFOs 310 may be sent to the PE array 76 as one matrix row or column. The write and read control blocks 314, 316, and 318 are used to generate FIFO access signals (e.g., 330, 332, 334, 336, 338, and 340). Some specific data like an identity matrix can also be generated by the read control block 318. The memory 320 may store the FIFO access patterns of each operation (e.g., each type of matrix decomposition). The memory 320 may store read patterns. Table 1 provides one example of a read pattern.
TABLE-US-00001 TABLE 1 Information of program A (length, loop time, etc.) Instruction 1 of program A Instruction 2 of program A ... Instruction N of program A Information of program B (length, loop time, etc.) Instruction 1 of program B ...
[0094] Table 2 illustrates one example instruction structure for the instructions of Table 1.
TABLE-US-00002 TABLE 2 Mode (M) data (M) valid (M) mode (V) data (V) valid (V) 2 bits 16 bits 16 bits 2 bits 1 bits 1 bits 01: read data Available only when Indicate the validation 01: read data Only when Indicate the from FIFOs mode = ‘10’: of each data from FIFO mode = ‘10’: validation of 10: constant Each bit represents an 10: constant ‘0’: number 0 each data data element of a row. data ‘1’: number 1 ‘0’: number 0 ‘1’: number 1
[0095]
[0096] A read control block 360 is used to make sure that the output of all FIFOs 356 are aligned. For example, the write control block 358 may receive instructions from delay buffer write instruction memory 362 indicating access patterns for the current application and from the instruction decoder 350 indicating, for example, matrix size (size_matrix), batch size (size_batch), and the function that was performed (Function). The write control block 358 may generate write enable (wr_en) signals to write into the FIFOs 356 and a start read (start_rd) signal for the read control block 360. The write control block 358 may trigger some or all of these signals upon receipt of a start of packet (sop) signal corresponding to the input data 352. The read control block 360 may use the start_rd signal from the write control block 358 and instructions from a delay buffer read instruction memory 364 indicating access patterns for the current application. The read control block 360 may also use the instruction decoder 350 indicating, for example, matrix size (size_matrix), batch size (size_batch), and the function that was performed (Function). Monitor circuitry 366 may provide error signals.
[0097] Example instructions that may be stored in the delay buffer write instruction memory 362 are shown below in Table 3. One such instruction can serve for a write process for one batch of matrices.
TABLE-US-00003 TABLE 3 Length of all data Delay of PE array Delay of rows (one batch) 10 bits 8 bits 12 bits Latency between the Output delay between Length of all the data of a first input and first 2 adjacent rows of PE batch in one row of PE output of PE array array. array.
[0098] Instructions in the delay buffer read instruction memory 364 may be organized as shown below in Table 4.
TABLE-US-00004 TABLE 4 Information of program A (length, loop time, etc.) Instruction 1 of program A Instruction 2 of program A ... Instruction N of program A Information of program B (length, loop time, etc.) Instruction 1 of program B ...
[0099] The instructions may be described as shown below in Table 5.
TABLE-US-00005 TABLE 5 valid_M valid_V pattern_V 1 bit 1 bit 4 bits Indicates that Indicates that Available only when valid_V = ‘1’. it is a matrix it is a vector. Indicates how many data in vector row or column. V are valid.
[0100] In this way, the delay buffer 82 may use the N FIFOs 356 to buffer both matrices and vectors. The input data 352 from the PE array 76 arrive in a staggered pattern, which is different from that of the main buffer 70. The write control block 358 is responsible for writing the data into alignment addresses of all the FIFOs 356. The read control block 360 causes data to be read from the FIFOs 356 and sent to the output port as output data 354 or looped back to the main buffer 70 as the loop-back intermediate data 90.
Instruction Set Architecture (ISA) of PE Array
[0101] ISA for an M PE 112. The behavior of each M PE 112 is controlled by the instruction it receives. An instruction contains the arithmetic operation to be performed and the routing selection of each signal.
TABLE-US-00006 TABLE 6 ctrl func mul1 mul2 add3 2 bits 2 bits 3 bits 3 bits 2 bits Mode of this Conjugate for 4 bits Source Source Source instruction. complex number Type of of 1.sup.st of 2.sup.nd of 3.sup.rd “NOP” means case (′ for arithmetic operand operand operand no operation. conjugate). operation. of ALU. of ALU. of ALU. 00 NOP 00 A B 0000 NOP 000 0 00 0 00 L_dat 01 Configure 01 A B′ 0001 A*B 001 U_dat 01 U_dat 01 U_dat 10 Arithmetic 10 A′ B 001x NOP 010 RF 10 RF 10 RF 11 A′ B′ 0100 A + B 011 latch 11 Latch 11 latch 0101 A − B 100 L_dat 100 L_dat 0110 B − A 101 in_dat 0111 NOP 1000 A*B + C 1001 A*B − C 1010 C − A*B 1011 NOP 1100 Mul & Acc 1101 Mul & Acc start 1110 Mul & Acc end 1111 NOP
TABLE-US-00007 TABLE 7 rdaddr dest wraddr sft latch 5 bits 3 bits 5 bits 5 bits 3 bits Read Destination Write Bit shift length Write value of selected address of ALU address upon MAC operand of ALU into of RF output. of RF output latch register. 000 NULL 000 No latch 001 D_dat 001 Latch mul1 010 R_dat 010 Latch mul2 011 out_dat 100 Latch add3 100 RF 011 Latch . . . . . . mul1&mul2
TABLE-US-00008 TABLE 8 TTLD TTLR muxD muxR muxO dlyO muxRF 5 bits 5 bits 2 bits 2 bits 2 bits 1 bit 2 bits Time to live Time to live Multiplex Multiplex Multiplex Output Multiplexer (TTL) (TTL) of D_dat of R_dat of out_dat delay of RF write vertical. horizontal. 00 0 00 L_dat 00 Input 0 0 00 L_dat TTLD = TTLR = 01 (ena_D = 0) 01 0 01 0 1 idxM d 01 (wren = 0) TTLD-1 TTLR-1 10 L_dat 10 U_dat (index of 1x L_dat when when 11 U dat 11 RF M PE, U_dat instruction instruction RF equals to crosses goes the rows. horizontally. column number in array)
[0102] Assembly language for an M PE 112. To display instructions in a more readable way, the instructions may be visualized in an assembly-like language: Assembly for M PE (ASMMPE). This is an assembly language designed for matrix decomposition using the PE array 76.
TABLE-US-00009 TABLE 9 ctrl func mul1 mul2 add3 raddr dest wraddr 10 001000 100 001 10 00101 011 00000 latch muxD muxR muxO Odly muxRF TTLD TTLR 000 10 00 10 0 0 00100 00100
[0103] Below, Table 10 provides various keywords that may be used by the ASMMPE language.
TABLE-US-00010 TABLE 10 operation add: addition (func) sub: subtraction cadd: addition of complex numbers csub: subtraction of complex numbers mul: multiplication cmul: multiplication of complex numbers cjmul: multiplication of A and B conjugate jcmul: multiplication of A conjugate and B jcmuladd: A′*B + C ncjmulsub: C − A*B′ cmulacc: multiply-accumlate Source L_dat: left input U_dat: upper input RFxx: address xx of RF latchxx: the xx latch in_dat: input data Destination D_dat: down output R_dat: right output out_dat: output data RF: address of RF Bit shift sft = xx: right shift xx bits Latch latch = 1: hold value of source 1 to latch register latch = 2: hold value of source 2 to latch latch = 3: hold value of source 3 to latch Source of Default: D_dat = 0 D_dat muxD = L: D_dat = L_dat muxD = U: D_dat = U_dat Source of Default: R_dat = L_dat R_dat muxR = 0: R_dat = 0 muxR = U: R_dat = U_dat Source of Default: out_dat = in_dat out_dat muxO = 0: out_dat = 0 Delay of Oldy = 0: output delay = 0 out_dat Oldy = 1: output delay = idxM (index of M PE, equals to the column number of the M PE in the PE array) Source of Default: wren = 0 RF write RFxx = L: write L_dat to RF with address xx RFxx = U: write U_dat to RF with address xx TTLD & TTL = x1, x2: TTL.sup.D = x1, TTL.sup.R = x2. Default value of TTLR TTL = max
[0104] ISA for a D PE 110. The behavior of each D PE 110 is also controlled by the instruction it receives. An instruction for the D PEs 110 includes five sub-instructions, where each sub-instruction belongs to one issue slot. As may be appreciated, when the D PEs 110 include more or fewer issue slots, there may be correspondingly more or fewer sub-instructions. As mentioned above, multiple issue slots work simultaneously to achieve a pipeline effect. Each issue slot runs just under its corresponding sub-instruction, and time offsets among multiple issue slots may also be specified by the program. Table 11, Table 12, Table 13, and Table 14 show an example instruction structure of the four kinds of issue slot discussed above with reference to
TABLE-US-00011 TABLE 11 ctrl write address 2 bits 2 bits 6 bits Mode of this RF bank Write address instruction. selection of RF 00 NOP 00 RF1 01 Configure 01 RF2 10 Arithmetic 10 RF3
TABLE-US-00012 TABLE 12 ctrl Rd addr Wr addr 2 bits source dest 6 bits 6 bits sft Mode 1 bit 1 bit 2 bits Read 2 bits Write 5 bits of this Source Destination RF bank address RF bank address Bit shift instruction. of input of output selection of RF selection of RF of result 00 NOP 0 RF 0 RF 00 RF1 00 RF1 01 Configure 1 U_dat 1 R_dat 01 RF2 01 RF2 10 Arithmetic 10 RF3 10 RF3
TABLE-US-00013 TABLE 13 Rd addr of source 1&2 6 + 6 Wr addr ctrl Source1 Source2 dest bits 6 bits 2 bits 2 bit 2 bit 2 bit 2 + 2 bits Read 2 bits Write Mode of this Source of Source of Destination RF bank address RF bank address instruction. input 1 input 2 of output selection of RF selection of RF 00 NOP 00 RF 00 RF 00 RF 00 RF1 00 RF1 01 Configure 01 U_dat 01 U_dat 01 R_dat 01 RF2 01 RF2 10 Arithmetic 10 latch 10 latch 10 acc 10 RF3 10 RF3 latch sft 1 bit addend 5 bits Hold value of source 1 bit Bit shift of result into latch register Whether to add ACC 0 Source 1 0 0 1 Source 2 1 acc
TABLE-US-00014 TABLE 14 ctrl source dest Rd addr 2 bits 1 bit 1 bit 2 bits 6 bits Mode of this Source of Destination RF bank Read address instruction. input of output selection of RF 00 NOP 0 RF 0 R_dat 00 RF1 01 Configure 1 U_dat 1 out_dat 01 RF2 10 Arithmetic 10 RF3
[0105] Each issue slot instruction may also include a time-to-live (TTL) domain to indicate whether that instruction should be executed or ignored (e.g., as NOP). For example, the TTL of an instruction for a D PE 110 may have a data structure as described in Table 15.
TABLE-US-00015 TABLE 15 TTL of D PE IS 1 IS 2 IS 3 IS 4 IS 5 5 bits 5 bits 5 bits 5 bits 5 bits
Processes of Matrix Decomposition on the Programmable Spatial Array Processor
[0106] The programmable spatial array processor 26 may be programmable to perform a wide variety of types of matrix decompositions. This section will describe the following types of matrix decompositions: [0107] Cholesky decomposition [0108] LU decomposition [0109] Cholesky based MMSE [0110] Givens-Rotation QR based MMSE [0111] Gram-Schmidt QR decomposition
[0112] The D PE 110 and M PE 112 may have a dataflow as illustrated in
[0113] Cholesky decomposition. Cholesky decomposition aims to find a lower triangular matrix L that satisfies L*L′=A. A is given as a positive definite Hermitian matrix:
A=L.Math.L.sup.H
[0114] The procedure of Cholesky decomposition is (R=A):
TABLE-US-00016 for (k=1 to N) L.sub.k:N,k = R.sub.k:N,k/sqrt(R.sub.k,k) L.sub.1:k−1,k = 0 for (j=k+1 to N) R.sub.j:N,j = R.sub.j:N,j − L.sub.j:N,k * L.sub.j,k′ end end
[0115]
TABLE-US-00017 TABLE 16 Input Isqrt MAC MAC Output (IS1) (IS 2) (IS3) (IS4) (IS5) mv RF1_1, U_dat isqrt RF2_1, U_dat mv RF1_2, U_dat mv RF1_3, U_dat mv RF1_4, U_dat mv RF1_5, U_dat isqrt RF2_2, U_dat mv RF1_6, U_dat mv RF1_7, U_dat mv RF1_8, U_dat . . . . . . . . . cmul RF3_1, RF1_1, RF2_1 | latch = 2 cmul R_dat, RF1_2, latch cmul R_dat, RF1_3, latch . . .
TABLE-US-00018 TABLE 17 ncjmulsub D_dat, L_dat, L_dat, U_dat | latch=2, RF1=L ncjmulsub D_dat, L_dat, latch, U_dat | TTL=max,2 ncjmulsub D_dat, L_dat, latch, U_dat | TTL=max,1 NOP ...
[0116] LU decomposition. The programmable spatial array processor 26 can also be used to perform LU decomposition. LU (lower-upper) decomposition factors matrix A as production of a lower triangluar matrix L and an upper triangular matrix U:
A=L*U
[0117] Example Matlab code of LU decomposition is shown below:
TABLE-US-00019 L = eye(N); U = R_ori; for k=1:N L(k+1:N, k) = U(k+1:N, k) / U(k,k); for j=k+1:N U(j, k+1:N) = U(j, k+1:N) − L(j, k).*U(k, k+1:N); end U(k, 1:k-1) = 0; end
[0118]
TABLE-US-00020 TABLE 18 Input Isqrt MAC MAC Output (IS1) (IS 2) (IS3) (IS4) (IS5) mv RF1_1, U_dat isqrt RF2_1, U_dat mv RF1_2, U_dat mv RF1_3, U_dat mv RF1_4, U_dat mv RF1_5, U_dat isqrt RF2_2, U_dat mv RF1_6, U_dat mv RF1_7, U_dat mv RF1_8, U_dat . . . . . . cmul RF2_1, RF2_1, RF2_1 cmul RF2_2, RF2_2, RF2_2 cadd RF3_1, RF1_1, 0 cmul R_dat, RF2_1, RF1_2 cmul R_dat, RF2_1, RF1_3 . . . cmul R_dat, RF2_1, RF1_4 . . .
TABLE-US-00021 TABLE 19 nop 0, 0, U_dat | latch=2, RF1=U ncmulsub D_dat, L_dat, latch, U_dat | TTL=3,max ncmulsub D_dat, L_dat, latch, U_dat | TTL=2,max ncmulsub D_dat, L_dat, latch, U_dat | TTL=1,max ...
[0119] Cholesky-based minimum mean square error (MMSE). The programmable spatial array processor 26 can also be used to perform Cholesky-based MMSE. An example procedure for performing Cholesky-based MMSE is provided below:
Description of input signals: [0120] MIMO channel coefficients H: N×N complex matrix, [0121] Noise power σ.sup.2: real scalar, [0122] Received signal Y: N×1 complex vector
The final result is:
x=(H.sup.HH+σ.sup.2I).sup.−1H.sup.HY
[0123] To implement it on the PE array 76, the procedure is divided into 4 stages: [0124] Pre-filtering
A=H.sup.H.Math.H,
R=A+σ.sup.2.Math.I
Z=H.sup.H.Math.Y. [0125] Cholesky decomposition
R=L.Math.L.sup.H [0126] Back substitution & V*Z
V=L.sup.−1
VZ=V*Z [0127] V.sup.H*(VZ)
x=V.sup.H*(VZ)
[0128] Reviewing each stage of Cholesky-based MMSE, pre-filtering may take place in the PE array 76 as illustrated by
TABLE-US-00022 TABLE 20 Input Isqrt MAC MAC Output (IS1) (IS 2) (IS3) (IS4) (IS5) cjmul acc, U_dat, U_dat jmv R_dat, U_dat cjmuladd acc, U_dat, U_dat, acc jmv R_dat, U_dat cjmuladd acc, U_dat, U_dat, acc jmv R_dat, U_dat cjmuladd acc, U_dat, U_dat, acc jmv R_dat, U_dat cmuladd out_dat, U_dat, 1, acc . . . . . .
TABLE-US-00023 TABLE 21 cmulacc acc, L_dat, U_dat | D=U cmulacc acc, L_dat, U_dat, acc | D=U cmulacc acc, L_dat, U_dat, acc | D=U cmulacc O_dat, L_dat, U_dat, acc | D=U, Odly=idxM ...
[0129] After pre-filtering, the second stage of Cholesky-based MMSE is Cholesky decomposition. This may take place in the same way described above. After Cholesky decomposition, Cholesky-based MMSE continues with back substitution and V*Z. Back substitution is used to solve V=L.sup.−1 in which L is an upper triangular matrix. V*Z is a matrix (V) vector (Z) multiplication:
V=L.sup.−1
V*Z
[0130] The procedure of back substitution may be described as:
TABLE-US-00024 for (i=1 to N) V.sub.i,i = 1/L.sub.i,i end for (i=2 to N) for (j=1 to i-1) V.sub.i,j = −V.sub.i,i * (Σ.sub.m=1.sup.i-1 L.sub.i,m * V.sub.m,j) end end
[0131]
TABLE-US-00025 TABLE 22 Input Isqrt MAC MAC Output (IS1) (IS 2) (IS3) (IS4) (IS5) cmul R_dat, U_dat, RF3_1 cmul R_dat, U_dat, RF3_1 cmul R_dat, U_dat, RF3_1 cmul R_dat, U_dat, RF3_1 . . .
TABLE-US-00026 TABLE 23 ncmulsub D_dat, L_dat, RF1, U_dat ncmulsub D_dat, L_dat, RF1, U_dat ncmulsub D_dat, L_dat, RF1, U_dat ncmulsub D_dat, L_dat, RF1, U_dat ...
[0132] The fourth stage of Cholesky-based MMSE is to calculate VH*(VZ):
V.sup.H*(VZ)
[0133]
TABLE-US-00027 TABLE 24 Input Isqrt MAC MAC Output (IS1) (IS 2) (IS3) (IS4) (IS5) mv R_dat, U_dat mv R_dat, U_dat . . .
TABLE-US-00028 TABLE 25 nop | R=U nop | D=U | TTL=3,max nop | D=U | TTL=2,max nop | D=U | TTL=1,max ...
[0134] Givens-Rotation QR based MMSE. Givens Rotation based QR decomposition (GR-QRD) uses a series of Givens rotation operations to eliminate the entries of a lower triangular part of R. One Givens rotation can zero the lower element of a 2×1 vector:
[0135] The α and β may be calculated as:
α=a/√{square root over (a.sup.2+b.sup.2)}
β=b/√{square root over (a.sup.2+b.sup.2)}
[0136] The entire procedure of QR decomposition may be described as:
Rotate the 1st and 2nd row of A to zero A(2,1).
Q*.sub.1A=R.sub.1
Then rotate the 1st and 3rd row of A to zero A(3,1).
Q*.sub.2Q*.sub.1A=R.sub.2
The last step is to rotate the N-1th and Nth row of A to zero A(N,N-1).
Q*.sub.m . . . Q*.sub.2Q*.sub.1A=R
Finally we get
[0137]
A=QR (Q=Q.sub.1Q.sub.2 . . . Q.sub.m)
[0138] The MATLAB code of above procedure is:
TABLE-US-00029 Q = eye(n); R = A; for k=1:1%n-1 for j=k+1:n alp = R(k,k) / sqrt(R(k,k)*R(k,k)′ + R(j,k)*R(j,k)′); bet = R(j,k) / sqrt(R(k,k)*R(k,k)′ + R(j,k)*R(j,k)′); R([k,j],:) = [alp′, bet′; -bet, alp] * R([k,j],:); Q(:, [k,j]) = Q(:, [k,j]) * [alp, -bet′; bet, alp′]; end end
[0139] In many cases, there is no need to obtain the Q matrix explicitly. For instance, the QRD based MMSE may include the following: [0140] Calculate H.sup.HH+σ.sup.2I and H.sup.HY. [0141] Perform QRD on H.sup.HH+σ.sup.2I, get R. [0142] Perform back substitution to get R.sup.−1. [0143] Get x=(QR).sup.−1H.sup.HY=R.sup.−1(Q.sup.HH.sup.HY)
[0144] One question about QRD is how to get Q.sup.H if there is no explicit Q calculation. The answer is that when Givens rotation is performed on H.sup.HH+σ.sup.2I, it should also be performed on H.sup.HY simultaneously as below equation shows:
[QR, H.sup.HY]—Givens Rotations—>[R, Q.sup.HH.sup.HY]
[0145]
[0146] To increase the utilization rate of MAC resources and data throughput, GR-QRD may be performed using an interleaved batch mode.
TABLE-US-00030 TABLE 26 Input Isqrt MAC MAC Output (IS1) (IS 2) (IS3) (IS4) (IS5) mv RF1_1_B1, U_dat cjmul RF3_1_B1, U_dat, U_dat mv RF1_1_B2, U_dat cjmul RF3_1_B2, U_dat, U_dat mv RF1_2_B1, U_dat cjmuladd RF3_2_B1, U_dat, U_dat, RF3_1_B1 mv RF1_2_B2, U_dat cjmuladd RF3_2_B2, U_dat, U_dat, RF3_1_B2 . . . . . . isqrt RF2_2_B1, RF3_2_B1 isqrt RF2_2_B2, RF3_2_B2 isqrt RF2_3_B1, cmul RF1_1_B1, RF3_3_B1 RF2_2_B1, RF3_2_B1 cmul R_dat, RF2_2_B1, RF1_1_B1 (cos) cmul R_dat, RF2_2_B1, RF1_2_B1 (sin) cmul R_dat, RF2_2_B1, RF1_1_B1 (cos) isqrt RF2_3_B2, cmul RF1_1_B2, RF3_3_B2 RF2_2_B2, RF3_2_B2 . . . cmul R_dat, RF2_2_B2, RF1_1_B2 (cos) cmul R_dat, RF2_2_B2, RF1_2_B2 (sin) cmul R_dat, RF2_2_B2, RF1_1_B2 (cos) cmul RF1_1_B1, RF2_3_B1, RF3_3_B1 cmul R_dat, RF2_3_B1, RF1_1_B1 (cos) cmul R_dat, RF2_3_B1, RF1_3_B1 (sin) cmul R_dat, RF2_3_B1, RF1_1_B1 (cos) cmul RF1_1_B2, RF2_3_B2, RF3_3_B2 cmul R_dat, RF2_3_B2, RF1_1_B2 (cos) cmul R_dat, RF2_3_B2, RF1_3_B2 (sin) cmul R_dat, RF2_3_B2, RF1_1_B2 (cos) . . .
TABLE-US-00031 TABLE 27 nop | RF1_B1=U nop nop nop nop | RF1_B2=U nop nop nop jcmulacc acc, L_dat, RF1_B1 ; c* .a1 jcmulacc RF1_B1, L_dat, U_dat, acc | latch=1,2 ; a1=c* .a1+s* .ak cmulacc acc, L_dat, latch2 ; c.ak ncmulacc D_dat, latch1, RF1_B1, acc ; b=c.ak−s.a1 jcmulacc acc, L_dat, RF1_B2 jcmulacc RF1_B2, L_dat, U_dat, acc | latch=1,2 cmulacc acc, L_dat, latch2 ncmulacc D_dat, latch1, RF1_B2, acc ...
[0147] Gram-Schmidt QR decomposition. GS (Gram-Schmidt) QR decomposition is a canonical and widely used matrix decomposition algorithm. The procedure is shown below:
[0148]
TABLE-US-00032 TABLE 28 Input Isqrt MAC MAC Output (IS1) (IS 2) (IS3) (IS4) (IS5) mv RF1_1, U_dat cjmulacc acc, U_dat, U_dat mv RF1_2, U_dat cjmulacc acc, U_dat, U_dat mv RF1_3, U_dat cjmulacc acc, U_dat, U_dat mv RF1_4, U_dat cjmulacc RF2_1, U_dat, U_dat isqrt RF2_1, RF2_1 mv RF1_5, U_dat cjmulacc acc, U_dat, U_dat mv RF1_6, U_dat cjmulacc acc, U_dat, U_dat cmul RF3_1 &R_dat, RF2_1, RF1_1 mv RF1_7, U_dat cjmulacc acc, U_dat, U_dat cmul RF3_2 &R_dat, RF2_1, RF1_2 mv RF1_8, U_dat cjmulacc RF2_2, U_dat, U_dat cmul RF3_3 &R_dat, RF2_1, RF1_3 cmul RF3_4 & R_dat, RF2_1, RF1_4 . . . isqrt RF2_2, RF2_2 mv R_dat, RF3_1 mv R_dat, RF3_2 mv R_dat, RF3_3 mv R_dat, RF3_4 . . . . . . . . . . . .
TABLE-US-00033 TABLE 29 ; calculate <q, a> cmulacc acc, L_dat, U_dat | RF1=U_dat cmulacc acc, L_dat, U_dat | RF2=U_dat cmulacc acc, L_dat, U_dat | RF3=U_dat cmulacc RF5, L_dat, U_dat | RF4=U_dat NOP ... NOP ; calculate a - <q,a>q nop, 0, 0, RF5 | latch=2 ncmulsub D_dat, L_dat, latch, RF1 ncmulsub D_dat, L_dat, latch, RF2 ncmulsub D_dat, L_dat, latch, RF3 ncmulsub D_dat, L_dat, latch, RF4 ...
[0149] Average throughput estimation. Tables 30 and 31 provide a rough estimate of the average throughput of the matrix decomposition examples discussed above. The parameters are defined as: matrix size of N×N, the size of one batch (number of matrices in one batch) is LenB, the gap between two consecutive batches is LenG clock cycles, and the delay of multiply-accumulate operations in each M PE 112 is DAcc.
TABLE-US-00034 TABLE 30 Decom- Givens- Gram- position Cholesky LU Rotation QR Schmidt QR Throughput (cycles per matrix)
TABLE-US-00035 TABLE 31 Givens- Gram- Decomposition Cholesky LU Rotation QR Schmidt QR Throughput N + 1 N + 1 4N + 1 2N + 12 (cycles per matrix)
EXAMPLE EMBODIMENTS
[0150] Various example embodiments, representing a non-limiting set of embodiments that may follow from this disclosure, are provided below.
[0151] EXAMPLE EMBODIMENT 1. A system comprising: [0152] programmable spatial array processing circuitry comprising: [0153] a processing element array having an array of processing elements, wherein the processing element array: [0154] when programmed with a first instructions, performs a first type of matrix decomposition; and [0155] when programmed with second instructions, performs a second type of matrix decomposition; and [0156] instruction memory that provides a portion of the first instructions or a portion of the second instructions sequentially to one processing element of a row of processing elements to sequentially propagate to other processing elements of the row of processing elements.
[0157] EXAMPLE EMBODIMENT 2. The system of example embodiment 1, wherein the array of processing elements has a triangular arrangement.
[0158] EXAMPLE EMBODIMENT 3. The system of example embodiment 1, wherein the array of processing elements comprise at least three different types of processing elements.
[0159] EXAMPLE EMBODIMENT 4. The system of example embodiment 3, wherein the at least three different types of processing elements comprise: [0160] a first diagonal processing element type that only receives data from above it in the array and only outputs data to a first side in the array; [0161] a second internal processing element type that only receives data from above it in the array and from a second side of it in the array and only outputs data to the first side of it in the array and down from it in the array; and [0162] a third vector processing element type that that only receives data from above it in the array and to from the second side of it in the array and outputs results of the processing element array.
[0163] EXAMPLE EMBODIMENT 5. The system of example embodiment 3, wherein the instruction memory comprises: [0164] a first instruction memory that stores instructions for all of the processing elements of a first type of the at least three different types of processing elements; [0165] a second instruction memory that stores instructions for all of the processing elements of a second type of the at least three different types of processing elements; and [0166] a third instruction memory that stores instructions for all of the processing elements of a third type of the at least three different types of processing elements.
[0167] EXAMPLE EMBODIMENT 6. The system of example embodiment 1, wherein the array of processing elements comprise: [0168] a first processing element having a first architecture that includes a plurality of issue slots; and [0169] a second processing element having a second architecture different from the first architecture that includes an arithmetic logic unit (ALU) configurable to perform multiply-accumulate operations.
[0170] EXAMPLE EMBODIMENT 7. The system of any of example embodiments 1-6, wherein the one processing element of the row of processing elements and the other processing elements of the row of processing elements do not contain [0171] instruction memory to hold more than one instruction at a time.
[0172] EXAMPLE EMBODIMENT 8. The system of any of example embodiments 1-6, wherein the first type of matrix decomposition comprises at least one of Cholesky decomposition, LU decomposition, Cholesky-based minimum mean square error (MMSE), Givens-Rotation QR based MMSE, and Gram-Schmidt QR decomposition, and wherein the second type of matrix decomposition comprises a different one of Cholesky decomposition, LU decomposition, Cholesky-based minimum mean square error (MMSE), Givens-Rotation QR based MMSE, and Gram-Schmidt QR decomposition.
[0173] EXAMPLE EMBODIMENT 9. The system of any of example embodiments 1-6, comprising host processing circuitry configurable to perform a task that involves performing the first type of matrix decomposition and cause the processing element array to be programmed with the first instructions.
[0174] EXAMPLE EMBODIMENT 10. The system of example embodiment 9, comprising a plurality of antennas, wherein the host processing circuitry and the programmable spatial array processing circuitry are configured to perform the first type of matrix decomposition or the second type of matrix decomposition to carry out a multiple-input multiple-output (MIMO) computation.
[0175] EXAMPLE EMBODIMENT 11. An article of manufacture comprising one or more tangible, non-transitory, machine readable media comprising instructions that, when executed by processing circuitry, cause the processing circuitry to: [0176] instruct a triangular spatial array of processing elements to perform a first matrix decomposition; and [0177] instruct the triangular spatial array of processing elements to perform a second matrix decomposition.
[0178] EXAMPLE EMBODIMENT 12. The article of manufacture of example embodiment 11, wherein the instructions cause the processing circuitry to instruct the triangular spatial array of processing elements to perform the first matrix decomposition on a batch of matrices.
[0179] EXAMPLE EMBODIMENT 13. The article of manufacture of example embodiment 12, wherein the instructions cause the processing circuitry to: [0180] instruct a first buffer to send the batch of matrices to the triangular spatial array of processing elements to be processed at least partly in parallel; and [0181] instruct a delay alignment buffer to realign resulting matrices output by the triangular spatial array of processing elements.
[0182] EXAMPLE EMBODIMENT 14. The article of manufacture of example embodiment 11, wherein the instructions cause the processing circuitry to perform Cholesky decomposition, LU decomposition, Cholesky-based minimum mean square error (MMSE), Givens-Rotation QR based MMSE, or Gram-Schmidt QR decomposition as the first matrix decomposition and perform a different one of Cholesky decomposition, LU decomposition, Cholesky-based minimum mean square error (MMSE), Givens-Rotation QR based MMSE, or Gram-Schmidt QR decomposition as the second matrix decomposition.
[0183] EXAMPLE EMBODIMENT 15. The article of manufacture of any of example embodiments 11-14, wherein the instructions cause the processing circuitry to send an instruction having a data structure that includes an indication of a function to be performed and a time-to-live value, wherein the time-to-live value causes that instruction to be ignored once that instruction has propagated through processing elements of the processing element array more than indicated by the time-to-live value.
[0184] EXAMPLE EMBODIMENT 16. An integrated circuit comprising programmable spatial array processing circuitry comprising: [0185] a buffer that outputs a batch of matrices; [0186] a processing element array comprising a plurality of processing elements that sequentially propagate instructions to adjacent processing elements rather than store all of the instructions in respective instruction memories, wherein the processing element array is programmable to: [0187] perform a first matrix decomposition on the batch of matrices when provided a first set of the instructions; and [0188] perform a second matrix decomposition on the batch of matrices when provided a second set of the instructions.
[0189] EXAMPLE EMBODIMENT 17. The integrated circuit of example embodiment 16, wherein the processing element array performs the first matrix decomposition or the second matrix decomposition on at least two of the batch of matrices at least partly in parallel.
[0190] EXAMPLE EMBODIMENT 18. The integrated circuit of example embodiments 16 or 17, wherein the processing element array comprises a processing element having a plurality of issue slots and a plurality of register files, wherein the at least two of the plurality of issue slots are programmable to operate in parallel to perform part of the first matrix decomposition or the second matrix decomposition.
[0191] EXAMPLE EMBODIMENT 19. The integrated circuit of example embodiments 16 or 17, wherein the processing element array outputs results from the first matrix decomposition or the second matrix decomposition staggered in time such that results corresponding to a first matrix of the batch of matrices overlap in time with results corresponding to a second matrix of the batch of matrices.
[0192] EXAMPLE EMBODIMENT 20. The integrated circuit of example embodiment 19, comprising a delay alignment buffer that aligns the results corresponding to the first matrix of the batch of matrices in time and aligns the results corresponding to the second matrix of the batch of matrices in time.
[0193] EXAMPLE EMBODIMENT 21. A method comprising:
[0194] using a triangular spatial array of processing elements to perform a first type of matrix decomposition; and
[0195] using the same triangular spatial array of processing elements to perform a second type of matrix decomposition.
[0196] EXAMPLE EMBODIMENT 22. The method of example embodiment 21, wherein using the triangular spatial array of processing elements to perform the first type of matrix decomposition comprises: [0197] using a first buffer to send the batch of matrices to the triangular spatial array of processing elements to be processed at least partly in parallel; and [0198] using a delay alignment buffer to realign resulting matrices output by the triangular spatial array of processing elements.
[0199] EXAMPLE EMBODIMENT 23. The method of example embodiment 21 or 22, wherein the first type of matrix decomposition comprises Cholesky decomposition, LU decomposition, Cholesky-based minimum mean square error (MMSE), Givens-Rotation QR based MMSE, or Gram-Schmidt QR decomposition and the second type of matrix decomposition comprises a different one of Cholesky decomposition, LU decomposition, Cholesky-based minimum mean square error (MMSE), Givens-Rotation QR based MMSE, or Gram-Schmidt QR decomposition.
[0200] EXAMPLE EMBODIMENT 24. A system comprising: [0201] a triangular processing element array having an array of processing elements, wherein the triangular processing element array comprises: [0202] means for performing a first type of matrix decomposition at a first time; and [0203] means for performing a second type of matrix decomposition at a second time.
[0204] EXAMPLE EMBODIMENT 25. The system of example embodiment 24, comprising: [0205] means for programming the triangular processing element array, wherein the processing elements do not include separate instruction memories that store more than one instruction.
[0206] While the embodiments set forth in the present disclosure may be susceptible to various modifications and alternative forms, specific embodiments have been shown by way of example in the drawings and have been described in detail herein. However, it should be understood that the disclosure is not intended to be limited to the particular forms disclosed. The disclosure is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the disclosure as defined by the following appended claims. Moreover, the techniques presented and claimed herein are referenced and applied to material objects and concrete examples of a practical nature that demonstrably improve the present technical field and, as such, are not abstract, intangible or purely theoretical. Further, if any claims appended to the end of this specification contain one or more elements designated as “means for [perform]ing [a function] . . . ” or “step for [perform]ing [a function] . . . ”, it is intended that such elements are to be interpreted under 35 U.S.C. 112(f). However, for any claims containing elements designated in any other manner, it is intended that such elements are not to be interpreted under 35 U.S.C. 112(f).