PARTITIONING SPARSE MATRICES BASED ON SPARSE MATRIX REPRESENTATIONS FOR CROSSBAR-BASED ARCHITECTURES
20200159810 ยท 2020-05-21
Inventors
- Chinmay Ghosh (Bangalore, IN)
- Soumitra CHATTERJEE (Bangalore, IN)
- Mashood Abdulla Kodavanji (Bangalore, IN)
- Mohan Parthasarathy (Bangalore, IN)
Cpc classification
G06F17/16
PHYSICS
G06F9/30036
PHYSICS
G06F9/5077
PHYSICS
International classification
G06F17/16
PHYSICS
G06F9/30
PHYSICS
G06F9/50
PHYSICS
Abstract
Example implementations relate to domain specific programming language (DSL) compiler for large scale sparse matrices. A method can comprise partitioning a sparse matrix into a plurality of submatrices based on a sparse matrix representation and inputting each one of the submatrices into a respective one of a plurality of matrix-vector multiplication units (MVMUs) of a crossbar-based architecture.
Claims
1. A method, comprising: partitioning a sparse matrix into a plurality of submatrices based on a sparse matrix representation; and inputting each one of the submatrices into a respective one of a plurality of matrix-vector multiplication units (MVMUs) of a crossbar-based architecture.
2. The method of claim 1, wherein inputting each one of the submatrices includes inputting each one of the submatrices into a respective one of a plurality of MVMUs of a dot product engine (DPE).
3. The method of claim 1, wherein partitioning the sparse matrix includes partitioning the sparse matrix using a domain specific programming language (DSL) compiler.
4. The method of claim 1, further comprising partitioning the sparse matrix into the plurality of submatrices based on dimensions of the MVMUs.
5. The method of claim 1, further comprising performing a matrix-vector multiplication (MVM) operation on the plurality of submatrices, in parallel, using the MVMUs.
6. The method of claim 1, wherein partitioning the sparse matrix includes, for each row of the sparse matrix: sorting non-zero column positions of the sparse matrix representation in increasing order; and rearranging non-zero elements of the sparse matrix according to the indices.
7. The method of claim 1, wherein partitioning the sparse matrix includes: traversing a first pointer associated with a first dimension of the sparse matrix; and iterating through non-zero elements of the sparse matrix according to a second pointer associated with a second dimension of the sparse matrix.
8. The method of claim 1, wherein the sparse matrix representation is a compressed sparse row (CSR) representation.
9. The method of claim 1, wherein the sparse matrix representation is a compressed sparse column (CSC) representation.
10. A non-transitory processor readable medium, comprising machine executable instructions that, when executed by a processor, cause the processor to: populate a plurality of submatrices of a sparse matrix with non-zero elements of the sparse matrix according to a compressed sparse row (CSR) representation of the sparse matrix, wherein dimensions of the submatrices are equal to dimensions of a plurality of crossbars; and input each one of the submatrices into a respective one of the crossbars.
11. The non-transitory processor readable medium of claim 10, further comprising machine executable instructions that, when executed by the processor, cause the processor to: traverse a row pointer of the CSR representation of the sparse matrix according to a height of the crossbars; traverse, from the row pointer, a column pointer of the CSR representation of the sparse matrix according to a width of the crossbars; and populate each row of each respective one of the submatrices with non-zero elements of the sparse matrix at column indices according to the traversal of the column pointer and the row pointer.
12. The non-transitory processor readable medium of claim 10, further comprising machine executable instructions that, when executed by the processor, cause the processor to: for each respective one of the submatrices: populate a subvector with values of the respective submatrix; and input the subvector and the respective submatrix into the respective one of the crossbars; and perform matrix-vector multiplication (MVM) operations, in parallel, on the subvectors and an input matrix using the crossbars.
13. The non-transitory processor readable medium of claim 10, further comprising machine executable instructions that, when executed by the processor, cause the processor to initialize the plurality of submatrices with zeros prior to populating the plurality of submatrices with non-zero elements of the sparse matrix according to the CSR representation of the sparse matrix.
14. A system, comprising: a dot product engine (DPE) compiler to: recognize at least one sparse matrix representation in a domain specific programming language (DSL); and partition a sparse matrix into a plurality of submatrices based on the at least one sparse matrix representation; and a DPE to: receive the plurality of submatrices; and perform matrix-vector multiplication (MVM) operations, in parallel, directly on the submatrices using tiles of the DPE.
15. The system of claim 14, wherein the at least one sparse matrix representation includes a set of three arrays described in the DSL.
16. The system of claim 15, wherein the set of three arrays includes: a first array representing a row pointer of the sparse matrix representation; a second array representing a column pointer of the sparse matrix representation; and a third array including values of the sparse matrix representation.
17. The system of claim 14, wherein the DPE is to sum results of the MVM operations according to row indices of the sparse matrix to generate a result vector.
18. The system of claim 14, wherein: the DPE compiler is to: generate metadata for each respective one of the submatrices indicating to which column indices of the sparse matrix each respective one of the submatrices correspond; and identify a subvector of each respective one of the submatrices based on the metadata; and the DPE is to perform MVM operations, in parallel, on the subvector of each respective one of the submatrices and an input matrix using crossbars of the DPE.
19. The system of claim 18, wherein the DPE compiler is to generate the metadata concurrently with partitioning the sparse matrix.
20. The system of claim 18, wherein the DPE compiler is to identify the subvector using the metadata as an index into an input vector.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0002]
[0003]
[0004]
[0005]
[0006]
[0007]
[0008]
[0009]
[0010]
DETAILED DESCRIPTION
[0011] A DPE is an example of a crossbar-based architecture. A DPE is a high-density, power efficient accelerator that utilizes the current accumulation feature of a memristor crossbar. A DPE, together with a fast conversion algorithm, can accelerate MVM in robust applications that do not use high computing accuracy such as neural networks. This approach to performing MVM operations in the analog domain can be orders of magnitude more efficient than digital application-specific integrated circuit (ASIC) approaches, especially increased crossbar array sizes.
[0012] A software development environment can be used to develop neural network models, targeting the DPE architecture, that take advantage of the parallel crossbars of the DPE architecture for performing MVM operations. The software development environment can use a domain specific programming language (DSL) and include a compiler that compiles a program written in a DSL into a DPE binary format and a loader that transfers data and instructions to the DPE and includes supporting libraries.
[0013] Sparse matrices and methods for using sparse matrices efficiently can be critical to the performance of many applications. As a result, sparse MVM operations can be of importance in computational science. Sparse MVM operations can represent a significant cost of iterative methods for solving large-scale linear systems, eigenvalue problems, and convolutional neural networks. Examples of sparse matrices include link matrices for links from one website to another and term occurrence matrices for words in an article against all known words in English.
[0014] Large scale sparse matrices can be a challenge in computations, such as MVM operations, because of the large memory and computational resource requirements of the computations. To avoid this challenge, sparse matrix representations, such as a CSR representation, that store only the non-zero elements of the sparse matrix can be used to reduce the consumption of memory and computational resources. Previous approaches to sparse matrix representations do not include partitioning of a sparse matrix without rebuilding the sparse matrix in memory.
[0015] The disclosure enables a DPE DSL compiler to recognize sparse matrix representations, including but not limited to CSR, coordinate list (COO), compressed sparse column (CSC), ELLPACK (ELL), diagonal (DIA), and hybrid (HYB) ELL+COO. The disclosure includes partitioning a sparse matrix into denser (more non-zero elements than zero elements) submatrices suitable for crossbars of a DPE without expanding the CSR notation back into the complete sparse matrix, which can improve use of host memory and reduce data transfer to memory of a DPE. Because only submatrices with non-zero valued elements are considered, use of crossbar resources can be optimized, thereby enabling scaling to large-scale sparse matrices.
[0016]
[0017]
[0018] The DPE programming language 224 can be a DSL that is defined by a set of data structures and application program interfaces (APIs). A non-limiting example of a DSL is a programming language based on C++ that is standardized by the International Organization for Standardization (ISO C++). The data structures and APIs can be building blocks of neural network algorithms implemented on a DPE, such as the DPE 100 described in association with
[0019] Operations to be performed on tensors as described by the DSL are captured by the computation graph 228. Each individual operation can be represented by one of the subgraphs 232. The computation graph 228 can be compiled into the DPE binary executable 234. The DPE binary executable 234 can be transferred for execution on a DPE, for example, by a loader component.
[0020]
[0021]
[0022]
[0023] The sparse matrix 460 includes a non-zero value in row 1 at column 0. Accordingly, as shown in
[0024] The sparse matrix 460 includes a non-zero value in row 2 at column 2. Accordingly, as shown in
[0025] The sparse matrix 460 includes a non-zero values in row 3 at columns 1, 3, and 4. Accordingly, as shown in
[0026] The sparse matrix 460 includes a non-zero values in row 4 at columns 0 and 2. Accordingly, as shown in
[0027] A DPE DSL compiler, for example the DPE compiler frontend 226 and backend 230 described in association with
TABLE-US-00001 template<typename T> CSRMatrix(std::vector<uint32_t> rowPtr, std::vector<uint32_t> columnPtr, std::vector<T> values);
[0028] The size of a matrix on which a crossbar, such as the crossbars 106 described in association with
[0029] In some examples, a row pointer of a CSR representation can be iterated through based on a dimension of crossbars (MVMU_WIDTH) to partition a sparse matrix into submatrices. Non-zero elements pointed to by a column pointer of a CSR representation for each row obtained from the row pointer can be placed into a submatrix. The respective column indices of the non-zero elements can be added to a vector representing metadata of the submatrix. If a column index is already in the vector because the corresponding column has multiple non-zero elements, then the column index is not added again to the vector. Once a submatrix is filled with a quantity of columns equal to MVMU_WIDTH, the procedure described above can be repeated for subsequent submatrices. The metadata is used to match the respective elements from the vector to be multiplied with an input matrix to form a result subvector of dimension MVMU_WIDTH. A subvector of each respective one of the submatrices can be identified based on the metadata. The metadata can be generated concurrently with partitioning the sparse matrix. The subvector can be identified using the metadata entries as an index into an input vector. The submatrix and the subvector form an input to a crossbar to perform an MVM operation. MVM operations can be performed, in parallel, on the subvector of each respective one of the submatrices and an input matrix using crossbars (e.g., of the DPE). The output from multiple crossbars can be summed up according to the row index of the original sparse matrix to form a result vector.
[0030] Although
[0031]
[0032]
[0033] Each of the submatrices 572 have a corresponding one of the subvectors 574. The subvectors 574-1, 574-2, . . . 574-k are collectively referred to as the vectors 574. Each of the subvectors 574 includes metadata for a corresponding one of the submatrices 572. The vector 574-1 includes the column indices of the sparse matrix 570 that are included in the submatrix 572-1, column indices 1, 3, and 6. Similarly, the vector 574-2 includes the column indices of the sparse matrix 570 that are included in the submatrix 572-2, column indices 8, 10, and 11, and the vector 574-k includes the column indices of the sparse matrix 570 that are included in the submatrix 572-k, column indices 3, 5, and 11.
[0034] An example method consistent with the present disclosure can include sorting, for each row of a sparse matrix, column indices of the sparse matrix that include non-zero elements in increasing order and rearranging the non-zero elements according to their respective column indices. For each MVMU_WIDTH quantity of rows of the sparse matrix, the column indices of the column pointer can be iterated through to find the lowest column index. Iterating through the column pointer can include obtaining the respective first column index.
[0035] For each MVMU_WIDTH quantity of rows of the sparse matrix, a tuple including a row index, a column index, and the value of each non-zero element can be generated and inserted into a list. The list of tuples can be sorted by the column indices. The lowest column index can be obtained. If the corresponding column of the submatrix already has a non-zero element of the sparse matrix, then the corresponding metadata has already been set and the non-zero element can be added to the submatrix. Otherwise, the column index can be added to the metadata and the non-zero element can be added to the next column of the sub-matrix. The next non-zero element for the same row can be obtained, a tuple can be formed, and the tuple can be added to the sorted list using an insertion sort, for example. This process can continue until the MVMU_WIDTH quantity of columns has been added to the submatrix. The submatrices can be initialized with all zeroes such that the non-zero values added to the sparse matrix replace zero values of the submatrix.
[0036] Once the MVMU_WIDTH quantity of columns has been added to the submatrix, a new submatrix can be formed. This can continue for the next set of the MVMU_WIDTH quantity of rows until all the rows of the sparse matrix are processed (the end of the row pointer is reached).
[0037]
[0038] As indicated by the horizontal arrows 665, in each iteration MVMU_WIDTH worth of rows are traversed. Row index I.sub.1 of the row pointer 662 points to column index J.sub.1 of the column pointer 664. Value V.sub.J1 is the value of the non-zero element at row index I.sub.1 and column index J.sub.1, value V.sub.J2 is the value of the non-zero element at row index I.sub.1 and column index J.sub.2, and value V.sub.J3 is the value of the non-zero element at row index I.sub.1 and column index J.sub.3 and so on. The first non-zero element for the row I.sub.1 can make a tuple consisting I.sub.1, J.sub.1, V.sub.J1 and can be inserted in the list of non-zero tuples 668.
[0039] Row index I.sub.2 of the row pointer 662 points to column index K.sub.1 of the column pointer 664. Value V.sub.K1 is the value of the non-zero element at row index I.sub.2 and column index K.sub.1, value V.sub.K2 is the value of the non-zero element at row index I.sub.2 and column index K.sub.2, and value V.sub.K3 is the value of the non-zero element at row index I.sub.2 and column index K.sub.3. The first non-zero element for the row I.sub.2 can make a tuple consisting I.sub.2, K.sub.1, V.sub.k1 and can be inserted in the list of non-zero tuples 668. The process described above can continue for a MVMU_WIDTH quantity of rows.
[0040] The list of tuples 668 can be sorted based on the increasing column value order of each tuple. Each element from the list of tuples 668 head can be removed and the value from each of the tuples can be inserted in the columns of the submatrix in the increasing order. The column position of each value in the input matrix, indicated by the second value in the tuple, can be added into the submatrix metadata (e.g., into the subvector 574-1). If a value for the same column is already added, this step may be ignored. A new non-zero entry for the same row is determined and added in appropriate position in the already sorted tuple list 668-1. This process continues until MVMU_WIDTH quantity of columns have been added into the submatrix. Further non-zero elements can be added into a new submatrix. This process continues until all the elements for MVMU_WIDTH quantity of rows are processed.
[0041] Although
[0042]
[0043]
[0044] The processor 880 can be a central processing unit (CPU), a microprocessor, and/or other hardware device suitable for retrieval and execution of instructions stored in the machine-readable storage medium 882. In the particular example shown in
[0045] The machine-readable storage medium 882 can be any electronic, magnetic, optical, or other physical storage device that stores executable instructions. Thus, the machine-readable storage medium 882 can be, for example, Random Access Memory (RAM), an Electrically-Erasable Programmable Read-Only Memory (EEPROM), a storage drive, an optical disc, and the like. The executable instructions can be installed on the system 881 illustrated in
[0046] The instructions 884, when executed by a processor such as the processor 880, can cause the system 881 to populate a plurality of submatrices of a sparse matrix with non-zero elements of the sparse matrix according to a CSR representation of the sparse matrix. Dimensions of the submatrices can be equal to dimensions of a plurality of crossbars.
[0047] Instructions 886, when executed by a processor such as the processor 880, can cause the system 881 to input each one of the submatrices into a respective one of the crossbars.
[0048] Although not specifically illustrated in
[0049] Although not specifically illustrated in
[0050]
[0051] At 994, the method can include inputting each one of the submatrices into a respective one of a plurality of MVMUs of a crossbar-based architecture. In some examples, each one of the submatrices can be input into a respective one of a plurality of MVMUs of a DPE. In some examples, partitioning the sparse matrix can be performed using a DSL compiler.
[0052] Although not illustrated in
[0053] In the foregoing detailed description of the present disclosure, reference is made to the accompanying drawings that form a part hereof, and in which is shown by way of illustration how examples of the disclosure may be practiced. These examples are described in sufficient detail to enable those of ordinary skill in the art to practice the examples of this disclosure, and it is to be understood that other examples may be utilized and that process, electrical, and/or structural changes may be made without departing from the scope of the present disclosure.
[0054] The figures herein follow a numbering convention in which the first digit corresponds to the drawing figure number and the remaining digits identify an element or component in the drawing. Similar elements or components between different figures may be identified by the use of similar digits. Elements shown in the various figures herein can be added, exchanged, and/or eliminated so as to provide a plurality of additional examples of the present disclosure. In addition, the proportion and the relative scale of the elements provided in the figures are intended to illustrate the examples of the present disclosure and should not be taken in a limiting sense.