MATRIX COMPUTING METHOD, CHIP, AND RELATED DEVICE
20240338174 ยท 2024-10-10
Inventors
Cpc classification
G06F9/3836
PHYSICS
G06F17/16
PHYSICS
International classification
G06F5/16
PHYSICS
G06F17/16
PHYSICS
Abstract
This application provides a matrix computing method, a chip, and a related device. The chip includes a first buffer, is configured to buffer a first vector, and a second buffer is configured to buffer a second vector. A scheduling module generates a selection signal based on a bitmap of the first vector. The selection signal may cause the processing element to obtain, from the first buffer, a group of non-zero elements in the first vector, and cause the processing element to obtain, from the second buffer, a group of elements in the second vector. An operation is performed between the first vector and the second vector based on the group of non-zero elements in the first vector and the group of elements in the second vector. In this application, an element whose value is 0 in one vector may be excluded from computing, to reduce a computing amount.
Claims
1. A chip, comprising: a first buffer, configured to buffer a first vector; a second buffer, configured to buffer a second vector; a first scheduler, configured to generate a first selection signal based on a bitmap of the first vector, wherein the first selection signal causes a first processor to obtain, from the first buffer, a first group of non-zero elements in the first vector, causes the first processor to obtain, from the second buffer, a second group of elements in the second vector, and the bitmap of the first vector indicates a non-zero element in the first vector; and the first processor, configured to implement an operation between the first vector and the second vector based on the first group of non-zero elements and the second group of elements.
2. The chip according to claim 1, wherein the chip further comprises a first multiplexer and a second multiplexer, wherein the first multiplexer is configured to obtain, from the first buffer, the first group of non-zero elements in the first vector based on the first selection signal, and input the first group of non-zero elements into the first processor; and the second multiplexer is configured to obtain, from the second buffer, the second group of elements in the second vector based on the first selection signal, and input the second group of elements into the first processor.
3. The chip according to claim 2, wherein the first multiplexer comprises K multi-path multiplexers, the second multiplexer comprises K multi-path multiplexers, the first buffer and the second buffer both comprise W rows and K columns of data units, and each data unit is configured to buffer one element; each multi-path multiplexer in the first multiplexer is connected to a plurality of data units in the first buffer, and each data unit is connected to at least one multi-path multiplexer; and a connection relationship between an i.sup.th multi-path multiplexer in the second multiplexer and a data unit in the second buffer is the same as a connection relationship between an i.sup.th multi-path multiplexer in the first multiplexer and a data unit in the first buffer.
4. The chip according to claim 3, wherein the first scheduler is specifically configured to: determine, based on the bitmap of the first vector, that an element stored in a k.sup.th data unit in data units connected to a j.sup.th multi-path multiplexer in the first multiplexer is a non-zero element; and the first scheduler generates a selection signal of the j.sup.th multi-path multiplexer, and sends the selection signal of the j.sup.th multi-path multiplexer to the j.sup.th multi-path multiplexer in the first multiplexer and a j.sup.th multi-path multiplexer in the second multiplexer, wherein the first selection signal comprises the selection signal of the j.sup.th multi-path multiplexer in the first multiplexer.
5. The chip according to claim 1, wherein the first scheduler is further configured to generate an erasing signal after the operation between the first vector and the second vector is implemented, wherein the erasing signal indicates the first buffer and the second buffer to erase currently buffered data.
6. The chip according to claim 4, wherein the first vector belongs to any one of rows in a first matrix, and the second vector belongs to any one of columns in a second matrix.
7. The chip according to claim 6, wherein the chip further comprises a third buffer and a second processor, wherein the third buffer is configured to buffer a third vector, wherein the third vector belongs to a column in the second matrix other than a column in which the second vector is located, and the first selection signal is further used to cause the second processor to obtain, from the third buffer, a third group of elements in the third vector; and the second processor is configured to implement an operation between the first vector and the third vector based on the first group of non-zero elements and the third group of elements.
8. The chip according to claim 7, wherein the chip further comprises a third multiplexer, configured to obtain, from the third buffer, the third group of elements in the third vector based on the first selection signal, and input the third group of elements into the second processor.
9. The chip according to claim 6, wherein the chip further comprises a fourth buffer, a second scheduler and a third processor, wherein the fourth buffer is configured to buffer a fourth vector, wherein the fourth vector belongs to a row in the first matrix other than a row in which the first vector is located; the second scheduler is configured to generate a second selection signal based on a bitmap of the fourth vector, wherein the second selection signal is used to cause the third processor to obtain, from the fourth buffer, a fourth group of non-zero elements in the fourth vector, cause the third processor to obtain, from the second buffer, a fifth group of elements in the second vector, and the bitmap of the fourth vector indicates a non-zero element in the fourth vector; and the third processor is configured to implement an operation between the fourth vector and the second vector based on the fourth group of non-zero elements and the fifth group of elements.
10. The chip according to claim 9, wherein the chip further comprises a fourth multiplexer and a fifth multiplexer, wherein the fourth multiplexer is configured to obtain, from the fourth buffer, the fourth group of non-zero elements in the fourth vector based on the second selection signal, and input the fourth group of non-zero elements into the third processor; and the fifth multiplexer is configured to obtain, from the second buffer, the fifth group of elements in the second vector based on the second selection signal, and input the fifth group of elements into the third processor.
11. A method for computing matrix on a chip, wherein the method comprises: buffering, by the chip, a first vector and a second vector, wherein the first vector is buffered in a first buffer, and the second vector is buffered in a second buffer; generating, by a first scheduler of the chip, a first selection signal based on a bitmap of the first vector, wherein the first selection signal causes a first processor to obtain, from the first buffer, a first group of non-zero elements in the first vector, causes the first processor to obtain, from the second buffer, a second group of elements in the second vector, and the bitmap of the first vector indicates a non-zero element in the first vector; and implementing, by the first processor of the chip, an operation between the first vector and the second vector based on the first group of non-zero elements and the second group of elements.
12. The method according to claim 11, wherein the chip further comprises a first multiplexer and a second multiplexer, and the method further comprises: obtaining, by the first multiplexer of the chip and from the first buffer, the first group of non-zero elements in the first vector based on the first selection signal, and inputting the first group of non-zero elements into the first processor; and obtaining, by the second multiplexer of the chip and from the second buffer, the second group of elements in the second vector based on the first selection signal, and inputting the second group of elements into the first processor.
13. The method according to claim 12, wherein the first multiplexer comprises K multi-path multiplexers, the second multiplexer comprises K multi-path multiplexers, the first buffer and the second buffer both comprise W rows and K columns of data units, and each data unit is configured to buffer one element; each multi-path multiplexer in the first multiplexer is connected to a plurality of data units in the first buffer, and each data unit is connected to at least one multi-path multiplexer; and a connection relationship between an i.sup.th multi-path multiplexer in the second multiplexer and a data unit in the second buffer is the same as a connection relationship between an i.sup.th multi-path multiplexer in the first multiplexer and a data unit in the first buffer.
14. The method according to claim 13, wherein the generating, by a first scheduler of the chip, a first selection signal based on a bitmap of the first vector comprises: determining, by the first scheduler of the chip based on the bitmap of the first vector, that an element stored in a kth data unit in data units connected to a j.sup.th multi-path multiplexer in the first multiplexer is a non-zero element; generating a selection signal of the j.sup.th multi-path multiplexer; and sending the selection signal of the j.sup.th multi-path multiplexer to the j.sup.th multi-path multiplexer in the first multiplexer and a j.sup.th multi-path multiplexer in the second multiplexer, wherein the first selection signal comprises the selection signal of the j.sup.th multi-path multiplexer in the first multiplexer.
15. The method according to claim 11, wherein after the implementing, by the first processing element of the chip, an operation between the first vector and the second vector based on the first group of non-zero elements and the second group of elements, the method further comprises: generating, by the first scheduler of the chip, an erasing signal, wherein the erasing signal indicates the first buffer and the second buffer to erase currently buffered data.
16. The method according to claim 14, wherein the first vector belongs to any one of rows in a first matrix, and the second vector belongs to any one of columns in a second matrix.
17. The method according to claim 16, wherein the chip further comprises a third buffer and a second processor, and the method further comprises: buffering, by the chip, a third vector, wherein the third vector is buffered in the third buffer, the third vector belongs to a column in the second matrix other than a column in which the second vector is located, and the first selection signal is further used to cause the second processor to obtain, from the third buffer, a third group of elements in the third vector; and implementing, by the second processor of the chip, an operation between the first vector and the third vector based on the first group of non-zero elements and the third group of elements.
18. The method according to claim 17, wherein the chip further comprises a third multiplexer, and the method further comprises: obtaining, by the third multiplexer of the chip and from the third buffer, the third group of elements in the third vector based on the first selection signal, and inputting the third group of elements into the second processor.
19. The method according to claim 16, wherein the chip further comprises a fourth buffer, a second scheduler and a third processor, and the method further comprises: buffering, by the chip, a fourth vector, wherein the fourth vector is buffered in the fourth buffer, and the fourth vector belongs to a row in the first matrix other than a row in which the first vector is located; generating, by the second scheduler of the chip, a second selection signal based on a bitmap of the fourth vector, wherein the second selection signal is used to cause the third processor to obtain, from the fourth buffer, a fourth group of non-zero elements in the fourth vector, cause the third processor to obtain, from the second buffer, a fifth group of elements in the second vector, and the bitmap of the fourth vector indicates a non-zero element in the fourth vector; and implementing, by the third processor of the chip, an operation between the fourth vector and the second vector based on the fourth group of non-zero elements and the fifth group of elements.
20. The method according to claim 19, wherein the chip further comprises a fourth multiplexer and a fifth multiplexer, and the method further comprises: obtaining, by the fourth multiplexer of the chip and from the fourth buffer, the fourth group of non-zero elements in the fourth vector based on the second selection signal, and inputting the fourth group of non-zero elements into the third processor; and obtaining, by the fifth multiplexer of the chip and from the second buffer, the fifth group of elements in the second vector based on the second selection signal, and inputting the fifth group of elements into the third processor.
Description
BRIEF DESCRIPTION OF DRAWINGS
[0062] To describe technical solutions in embodiments of this application more clearly, the following briefly describes the accompanying drawings for describing embodiments. It is clear that the accompanying drawings in the following descriptions show merely some embodiments of this application, and a person of ordinary skill in the art may still derive other drawings from these accompanying drawings without creative efforts.
[0063]
[0064]
[0065]
[0066]
[0067]
[0068]
[0069]
[0070]
[0071]
[0072]
[0073]
[0074]
[0075]
[0076]
DESCRIPTION OF EMBODIMENTS
[0077] The following describes technical solutions in embodiments of this application with reference to accompanying drawings in embodiments of this application. It is clear that the described embodiments are merely a part of but not all of embodiments of this application. All other embodiments obtained by a person of ordinary skill in the art based on embodiments of this application without creative efforts shall fall within the protection scope of this application.
[0078] Currently, in a scenario related to a matrix multiplication operation, to improve computing efficiency, a pruning technology is usually used to perform sparsification processing on a matrix, so as to reduce a computing amount and improve the computing efficiency. For example, after AI training is completed, structured pruning is performed on a weight matrix obtained through training to complete sparsification of the weight matrix. A weight matrix obtained through structured pruning is used for inference during AI inference. However, the foregoing method is applicable only to performing, after a matrix participating in computing is determined, a pruning operation on the matrix, for example, to an AI inference scenario, but is not applicable to another scenario in which a sparsity degree of a matrix dynamically changes. For example, during the AI training, a weight matrix, a gradient matrix, an activation matrix, or the like dynamically changes, and a sparsity degree of a matrix also dynamically changes. If matrix sparsification needs to be implemented in an AI training process, large time costs are required, and an acceleration effect brought after the matrix sparsification is implemented through pruning is offset. Therefore, how to accelerate matrix computing when the sparsity degree of the matrix dynamically changes is an urgent technical problem to be resolved.
[0079] This application provides a chip, to implement an operation on a matrix whose sparsity degree dynamically changes.
[0080] When an operation between a first vector and a second vector needs to be performed, the first buffer 120 is configured to buffer all or a part of elements in the first vector, and the second buffer 140 is configured to buffer all or a part of elements in the second vector. The scheduling module 160 is configured to generate a selection signal based on data in the first buffer 120, and send the selection signal to the first multiplexer 130 and the second multiplexer 150. The selection signal is used to cause the first multiplexer 130 to obtain a group of non-zero elements from the connected first buffer 120, and cause the second multiplexer to obtain a group of elements from the connected second buffer 140. The first multiplexer 130 and the second multiplexer 150 input a group of data respectively obtained from corresponding buffers into the processing element 110, to cause the processing element 110 to perform the operation between the first vector and the second vector.
[0081] The following separately describes, with reference to the accompanying drawings in detail, each part included in the chip 100.
[0082]
[0083]
[0084] For example,
[0085] It should be noted that, that each multi-path multiplexer is connected to eight data units shown in
[0086] In this embodiment of this application, each multi-path multiplexer reads a non-zero element from a plurality of connected data units each time. When determining a data unit in which data is to be read by a multi-path multiplexer, the scheduling module 160 determines, based on priorities and starting from a data unit with a highest priority, whether an element in a data unit with a priority of 1 is 0. If the element in the data unit with the priority of 1 is not 0, the scheduling module 160 generates a selection signal of the data unit with the priority of 1, so that the multi-path multiplexer reads the element in the data unit with the priority of 1; or if the element in the data unit with the priority of 1 is 0, the scheduling module 160 further determines whether an element in a data unit with a priority of 2 is 0. If the element in the data unit with the priority of 2 is not 0, the scheduling module 160 generates a selection signal of the data unit with the priority of 2, so that the multi-path multiplexer reads the element in the data unit with the priority of 2; or if the element in the data unit with the priority of 2 is 0, the scheduling module 160 further determines whether an element in a data unit with a priority of 3 is 0. By analogy, the scheduling module 160 finds, in a priority sequence, a data unit whose stored element is not 0, and then generates a selection signal corresponding to the data unit, so that the multi-path multiplexer reads the element in the data unit and sends the element to the processing element 110, to cause the processing element 110 to perform a dot product operation.
[0087] One buffer is connected to K multi-path multiplexers. The scheduling module 160 needs to generate K scheduling signals DS.sub.j for the K multi-path multiplexers corresponding to one buffer in one periodicity, and each multi-path multiplexer corresponds to one selection signal. j=1, 2, . . . , 8, and j is a positive integer. In other words, in each periodicity, each multi-path multiplexer needs to read one element from the buffer based on the selection signal of the scheduling module 160 and send the element to the processing element 110. In one periodicity, the multiplexer can obtain K elements from the connected buffer by using the K multi-path multiplexers. It should be noted that, if elements in a plurality of data units connected to one multi-path multiplexer are all 0, the multi-path multiplexer sends the element 0 to the processing element 110.
[0088] In this embodiment of this application, the scheduling module 160 determines, based on a bitmap (bitmap) corresponding to data stored the buffer, whether an element in each data unit is 0. Specifically,
[0089] When the scheduling module 160 needs to generate a selection signal of a multi-path multiplexer, the scheduling module 160 first determines, whether a value of a bit corresponding to a data unit with a priority of 1 in data units connected to the multi-path multiplexer in the bitmap is 0. If the value of the bit corresponding to the data unit with the priority of 1 in the bitmap is 1, it indicates that an element in the data unit with the priority of 1 is not 0. The scheduling module 160 generates a selection signal corresponding to the data unit with the priority of 1, and sends the selection signal to the multi-path multiplexer. If the value of the bit corresponding to the data unit with the priority of 1 in the bitmap is 0, it indicates that an element in the data unit with the priority of 1 is 0. The scheduling module 160 further determines whether a value of a bit corresponding to a data unit with a priority of 2 in the data units connected to the multi-path multiplexer in the bitmap is 0. If the value of the bit corresponding to the data unit with the priority of 2 in the bitmap is 1, it indicates that an element in the data unit with the priority of 2 is not 0. The scheduling module generates a selection signal corresponding to the data unit with the priority of 2, and sends the selection signal to the multi-path multiplexer. If the value of the bit corresponding to the data unit with the priority of 2 in the bitmap is 0, it indicates that the element in the data unit with the priority of 2 is 0. The scheduling module 160 further determines whether a value of a bit corresponding to a data unit with a priority of 3 in the data units connected to the multi-path multiplexer in the bitmap is 0. By analogy, details are not described herein again.
[0090] It should be noted that, after the scheduling module 160 controls a multi-path multiplexer to obtain a non-zero element from a data unit, the scheduling module 160 needs to set a corresponding position of the element stored in the data unit in the bitmap to 0, to prevent the element in the data unit from being repeatedly read, and avoid a computing error caused because a non-zero element in a data unit with a priority lower than the data unit is not read and does not participate in an operation.
[0091]
[0092]
[0093] When the dot product operation is performed by using the chip 100, the chip 100 loads an element included in the vector C into a data unit in the first buffer 120, and loads an element included in the vector D into a data unit in the second buffer 140. When data in the vector C is loaded into the first buffer 120 and data in the vector D is loaded into the second buffer 140, data in a 1.sup.st column to a K.sup.th column in the vector C is sequentially stored in K data units in a 1.sup.st row in the first buffer 120, and data in a 1.sup.st row to a Kth row in the vector D is sequentially stored in K data units in a 1.sup.st row in the second buffer 140. Data in a (K+1)th column to a 2K.sup.th column in the vector C is sequentially stored in K data units in a 2.sup.nd row in the first buffer 120, and data in a (K+1).sup.th row to a 2K.sup.th row in the vector D is sequentially stored in K data units in a 2.sup.nd row in the second buffer 140. The rest may be deduced by analogy until data in a (W?1)Kth column to a WK.sup.th column in the vector C is sequentially stored in K data units in a W.sup.th row in the first buffer 120, and data in a (W?1)K.sup.th row to a WK.sup.th row in the vector D is sequentially stored in K data units in a W.sup.th row in the second buffer 140.
[0094] After W*K pieces of data in the vector C are stored in the first buffer 120, the first buffer 120 generates a corresponding bitmap based on a value of an element stored in each data unit. If a value of an element in a data unit is not 0, a corresponding bit of the data unit in the bitmap is set to 1; or if a value of an element in a data unit is 0, a corresponding bit of the data unit in the bitmap is set to 0. As shown in
[0095] After generating the bitmap, the first buffer 120 sends the bitmap to the scheduling module 160. After receiving the bitmap, the scheduling module 160 needs to first generate a selection signal DS.sub.1 of the 1.sup.st multi-path multiplexer in the first multiplexer 130. Specifically, the scheduling module 160 first determines, based on priorities of a plurality of data units connected to the 1.sup.st multi-path multiplexer in the first multiplexer 130, whether a value of a bit corresponding to a data unit with a priority of 1 in the data units connected to the 1.sup.st multi-path multiplexer in the first multiplexer 130 in the bitmap is 0. If the value of the bit corresponding to the data unit with the priority of 1 in the bitmap is not 0, the scheduling module 160 generates a selection signal 000, and sends the selection signal 000 to the 1.sup.st multi-path multiplexer in the first multiplexer 130 and the 1.sup.st multi-path multiplexer in the second multiplexer 150. The selection signal 000 is used to cause the 1.sup.st multi-path multiplexer in the first multiplexer 130 to read an element in the data unit with the priority of 1 and send the element to the processing element 110, and cause the 1.sup.st multi-path multiplexer in the second multiplexer 150 to read the element in the data unit with the priority of 1 and send the element to the processing element 110.
[0096] If the scheduling module 160 determines that the value of the bit corresponding to the data unit with the priority of 1 in the data units connected to the 1.sup.st multi-path multiplexer in the first multiplexer 130 in the bitmap is 0, the scheduling module 160 further determines whether a value of a bit corresponding to a data unit with a priority of 2 in the data units connected to the 1.sup.st multi-path multiplexer in the first multiplexer 130 in the bitmap is 0. If the value of the bit corresponding to the data unit with the priority of 2 in the bitmap is not 0, the scheduling module 160 generates a selection signal 001, and sends the selection signal 001 to the 1.sup.st multi-path multiplexer in the first multiplexer 130 and the 1.sup.st multi-path multiplexer in the second multiplexer 150. The selection signal 001 is used to cause the 1.sup.st multi-path multiplexer in the first multiplexer 130 to read an element in the data unit with the priority of 2 and send the element to the processing element 110, and cause the 1.sup.st multi-path multiplexer in the second multiplexer 150 to read the element in the data unit with the priority of 2 and send the element to the processing element 110.
[0097] If the scheduling module 160 determines that the value of the bit corresponding to the data unit with the priority of 2 in the data units connected to the 1.sup.st multi-path multiplexer in the first multiplexer 130 in the bitmap is 0, the scheduling module 160 further determines whether a value of a bit corresponding to a data unit with a priority of 3 in the data units connected to the 1.sup.st multi-path multiplexer in the first multiplexer 130 in the bitmap is 0. The rest may be deduced by analogy until the 1.sup.st multi-path multiplexer in the first multiplexer 130 reads one piece of data c.sub.1 from the first buffer 120 and sends the data c.sub.1 to the processing element 110, and the 1.sup.st multi-path multiplexer in the second multiplexer 150 reads one piece of data d.sub.1 from the second buffer 140 and sends the data d.sub.1 to the processing element 110, to cause the processing element 110 to perform an operation of c.sub.1*d.sub.1. It should be noted that, if non-zero data is present in elements buffered by the plurality of data units connected to the 1.sup.st multi-path multiplexer in the first multiplexer 130, a value of c.sub.1 is not 0, and d.sub.1 may be 0 or may not be 0. If all data buffered by the plurality of data units connected to the 1.sup.st multi-path multiplexer in the first multiplexer 130 is 0, the value of c.sub.1 is 0, and d.sub.1 may be 0 or may not be 0.
[0098] For a 2.sup.nd to a K.sup.th multi-path multiplexers in the first multiplexer 130, the scheduling module 160 sequentially generates corresponding selection signals DS.sub.2 to DS.sub.K through a same method, so that each multi-path multiplexer in the first multiplexer 130 and the second multiplexer 150 outputs one piece of data to the processing element 110. In a 1.sup.st periodicity, each multi-path multiplexer in the first multiplexer 130 and the second multiplexer 150 outputs one piece of data to the processing element 110, so that the processing element 110 completes K times of product operations and K?1 times of addition operations. The K times of product operations and the K?1 times of addition operations are: e.sub.1=c.sub.1d.sub.1+c.sub.2d.sub.2+ . . . +c.sub.td.sub.t+ . . . +c.sub.Kd.sub.K, where c.sub.t represents data output by a t.sup.th multi-path multiplexer in the first multiplexer 130, d.sub.t represents data output by a t.sup.th multi-path multiplexer in the second multiplexer 150, and t is a positive integer greater than 0 and less than or equal to K.
[0099] It should be noted that, after the scheduling module 160 generates a selection signal and sends the selection signal to the first multiplexer 130, and the first multiplexer 130 reads one piece of data from a data unit, the scheduling module 160 sets a corresponding position of the data unit in the bitmap to 0.
[0100] In a 2.sup.nd periodicity and each subsequent periodicity, the scheduling module 160 continues to perform operations performed in the first periodicity, so that each multi-path multiplexer in the first multiplexer 130 and the second multiplexer 150 outputs one piece of data to the processing element 110, and the processing element 110 completes K times of product operations and K?1 times of addition operations, until values of all bits in the bitmap are all 0. Finally, values obtained after the processing element 110 completes K times of product operations and K?1 times of addition operations in each periodicity are added, and a dot product of the vector C and the vector D is obtained.
[0101] Because one buffer includes W*K data units and a multiplexer may read data in K data units to participate in computing in one periodicity, the dot product operation between the vector C and the vector D can be completed after a maximum of W periodicities. If the vector C has a specific sparsity degree, in other words, if some elements whose values are 0 are present in the vector C, even if the sparsity degree of the vector C that is input to the buffer each time changes, an element whose value is 0 in the vector C can be excluded from computing by using the chip provided above, to reduce a computing amount, and improve computing efficiency without reducing computing precision.
[0102] It should be noted that, after the scheduling module 160 determines that all values in the bitmap are 0, the scheduling module 160 generates an erasing signal, and sends the erasing signal to the first buffer 120 and the second buffer 140, so that the first buffer 120 and the second buffer 140 erase currently buffered data, to help buffer a next batch of data.
[0103] It should be understood that the vector C may be a part of a vector X, and the vector D may be a part of a vector Y For example, the vector X is a 1*Z vector, and the vector Y is a Z*1 vector, where Z is greater than W*K. Because one buffer in the chip 100 can store only W*K pieces of data each time, the vector X and the vector Y are segmented, and a maximum of W*K elements are stored in the buffer of the chip 100 for computing each time. The vector C may be a row vector with one row and W*K columns, the vector D may be any column in a matrix with W*K rows and T columns, and an operation result of the vector C and the matrix is a vector with one row and T columns. In a process of performing an operation between the vector C and the matrix, one column in the matrix is buffered in the second buffer 140 each time, to obtain an element in the operation result.
[0104]
[0105] In this embodiment of this application, the chip 200 can implement a multiplication operation between a vector and a matrix. That the chip 200 shown in
[0106]
[0107] After WK pieces of data in the vector C are stored in the buffer B.sub.0, the buffer B.sub.0 generates a corresponding bitmap. For a method for generating the corresponding bitmap by the buffer B.sub.0, refer to the method for generating the bitmap by the first buffer 120. Details are not described herein again. After generating the bitmap, the buffer B.sub.0 sends the bitmap to the scheduling module 210. After receiving the bitmap, the scheduling module 210 first generates a selection signal DS.sub.1 of a 1.sup.st multi-path multiplexer in the multiplexer M.sub.0. For a method for generating the selection signal DS.sub.1 by the scheduling module 210, refer to the method for generating the selection signal DS.sub.1 by the scheduling module 160. Details are not described herein again.
[0108] In this embodiment of this application, after generating the selection signal DS.sub.1, the scheduling module 210 sends the selection signal DS.sub.1 to 1.sup.st multi-path multiplexers in the multiplexers M.sub.0 to M.sub.N. The 1.sup.st multi-path multiplexer in the multiplexer M.sub.0 reads one piece of data based on the selection signal DS.sub.1, and sends the data to N processing elements PE.sub.0 to PE.sub.N. 1.sup.st multi-path multiplexers in multiplexers M.sub.1 to M.sub.N each read one piece of data based on the selection signal DS.sub.1, and send the data to processing elements connected to the multiplexers.
[0109] For a 2.sup.nd to a K.sup.th multi-path multiplexers in the multiplexers M.sub.0 to M.sub.N, the scheduling module 210 sequentially generates corresponding selection signals DS.sub.2 to DS.sub.K through a same method, so that each multi-path multiplexer in the multiplexers M.sub.0 to M.sub.N outputs one piece of data to the processing element PE.sub.1 to PE.sub.N. After the scheduling module 210 sequentially generates selection signals DS.sub.1 to DS.sub.K in one periodicity, each processing element obtains K pairs of data, and completes K times of product operations and K?1 times of addition operations.
[0110] It should be noted that, after the scheduling module 210 generates a selection signal and sends the selection signal to the multiplexer M.sub.0, and the multiplexer M.sub.0 reads one piece of data from a data unit, the scheduling module 210 sets a corresponding position of the data unit in the bitmap to 0.
[0111] In a 2.sup.nd periodicity and each subsequent periodicity, the scheduling module 210 continues to perform operations performed in the first periodicity, so that each multi-path multiplexer in the multiplexers M.sub.0 to M.sub.N outputs one piece of data to the processing element PE.sub.1 to PE.sub.N, and the processing element PE.sub.1 to PE.sub.N completes K times of product operations and K?1 times of addition operations, until values of all bits in the bitmap are all 0. For any processing element PE.sub.h, values obtained after K times of product operations and K?1 times of addition operations are completed in each periodicity are added, that is, a dot product of the vector C and a vector Dh. h is a positive integer greater than or equal to 1 and less than or equal to N.
[0112] A multiplication operation result of the vector C and the matrix B is a 1*N vector H. A value output by the processing element PE.sub.h after the dot product operation between the vector C and the vector D.sub.h is completed is a value of an h.sup.th element in the vector H.
[0113] It should be understood that the vector C may be a part of a vector X, or may be a part or all of elements in a row in a matrix E. The matrix B may be a part of a matrix F. For example, the vector X is a 1*Z vector, and the matrix F is a Z*N vector, where Z is greater than W*K. Because one buffer in the chip 200 can store only W*K pieces of data each time, the vector X and the matrix F are segmented, W*K elements in the vector X are stored in the buffer B.sub.0 of the chip 200 each time, and W*K elements in an h.sup.th column in the matrix F are stored in a buffer B.sub.h of the chip 200. In other words, W*K elements in a.sub.1t column to an N.sup.th column in the matrix F are distributed and stored in the buffers B.sub.0 to B.sub.N of the chip 200.
[0114]
[0115] In this embodiment of this application, the chip 300 can implement a multiplication operation between matrices. That the chip 300 shown in
[0116] The chip 200 shown in
[0117] It should be noted that, after obtaining elements in the matrix A, each buffer in buffers B.sub.10, B.sub.20, . . . , B.sub.g0, . . . , B.sub.M0 in the chip 300 generates a bitmap corresponding to data buffered by each buffer, and sends the bitmap to a scheduling module connected to each buffer. For example, B.sub.10 generates a bitmap 1 and sends the bitmap 1 to the scheduling module S.sub.1, B.sub.20 generates a bitmap 2 and sends the bitmap 2 to the scheduling module S.sub.2, B.sub.g0 generates a bitmap 1 and sends the bitmap 1 to the scheduling module S.sub.g, and the like.
[0118] It should be understood that the matrix A may be a part of a matrix G, and the matrix B may be a part of a matrix F. For example, the matrix A is an M*Z vector, and the matrix F is a Z*N vector, where Z is greater than W*K. Because one buffer in the chip 300 can store only W*K pieces of data each time, the matrix G and the matrix F are segmented, W*K elements in a g.sup.th row in the matrix G are stored in the buffer B.sub.g0 of the chip 300 each time, and W*K elements in an h.sup.th column in the matrix F are stored in a buffer B.sub.1h of the chip 200.
[0119] By using the foregoing chip 300, in a process of performing the multiplication operation between matrices, an element whose element value is 0 in the matrix may not participate in computing, to reduce a computing amount, improve computing efficiency without reducing computing precision.
[0120] With reference to
[0121] S121: A chip buffers a first vector and a second vector.
[0122] The chip may be the chip 100 in
[0123] It should be noted that the first vector may be the vector C in the embodiment shown in
[0124] S122: A first scheduling module of the chip generates a first selection signal based on a bitmap of the first vector.
[0125] The bitmap of the first vector indicates a non-zero element in the first vector. For the bitmap of the vector, refer to the related descriptions corresponding to
[0126] In this embodiment of this application, the chip 100 further includes a first multiplexer and a second multiplexer. The first processing element obtains, from the first buffer, the first group of non-zero elements in the first vector by using the first multiplexer, and obtains, from the second buffer, the second group of non-zero elements in the second vector by using the second multiplexer. The first scheduling module may be the scheduling module 160 in
[0127] S123: The first processing element of the chip implements an operation between the first vector and the second vector based on the first group of non-zero elements and the second group of elements.
[0128] The first processing element may be the processing element 110 in
[0129] In this embodiment of this application, the chip may further include a third buffer and a second processing element. The third buffer is configured to buffer a third vector. The first selection signal may further cause the second processing element to obtain, from the third buffer, a third group of elements in the third vector, and cause the second processing element to obtain, from the first buffer, the first group of non-zero elements in the first vector. The second processing element implements an operation between the first vector and the third vector based on the first group of non-zero elements in the first vector and the third group of elements in the third vector. The third vector belongs to a column in the second matrix other than a column in which the second vector is located.
[0130] In a possible implementation, the first buffer may be the buffer B.sub.0 in
[0131] In this embodiment of this application, in addition to the first buffer, the second buffer, the first scheduling module, the first processing element, the third buffer, and the second processing element, the chip may further include a fourth buffer, a second scheduling module, and a third processing element. The fourth buffer is configured to buffer a fourth vector, where the fourth vector belongs to a row in the first matrix other than a row in which the first vector is located. The second scheduling module generates a second selection signal based on a bitmap of the fourth vector. The second selection signal may cause the third processing element to obtain, from the fourth buffer, a fourth group of non-zero elements in the fourth vector, and cause the third processing element to obtain, from the second buffer, a fifth group of elements in the second vector. The second processing element can implement an operation between the fourth vector and the second vector based on the fourth group of non-zero elements in the fourth vector and the fifth group of elements in the second vector. The bitmap of the fourth vector indicates a non-zero element in the fourth vector.
[0132] In a possible implementation, the first buffer may be a buffer B.sub.10 in
[0133] For brief description, the foregoing method embodiments are all described as a combination of a series of actions. However, a person skilled in the art should understand that the present invention is not limited to the described action sequence. In addition, a person skilled in the art should also understand that all embodiments described in this specification are embodiments, and the related actions are not necessarily mandatory to the present invention.
[0134] Another appropriate step combination that a person skilled in the art can think of based on the content described above also falls within the protection scope of the present invention. In addition, a person skilled in the art should also understand that all embodiments described in this specification are preferred embodiments, and the related actions are not necessarily mandatory to the present invention.
[0135] The foregoing describes in detail the chips and the methods for performing matrix computing based on the chip provided in embodiments of this application with reference to
[0136] In a possible implementation, after the first processing unit 133 implements the operation between the first vector and the second vector based on the first group of non-zero elements and the second group of elements, the first scheduling unit 132 generates an erasing signal, where the erasing signal indicates the first buffer and the second buffer to erase currently buffered data.
[0137] In a possible implementation, the first vector belongs to a part or all of elements in any row in a first matrix, and the second vector belongs to a part or all of elements in any column in a second matrix.
[0138] In a possible implementation, the matrix computing apparatus further includes a second processing unit 134, and the first selection signal may further cause the second processing unit 134 to obtain a third group of elements in a third vector, and cause the second processing unit 134 to obtain the first group of non-zero elements in the first vector. The second processing unit 134 is configured to implement an operation between the first vector and the third vector based on the first group of non-zero elements and the third group of elements. The third vector belongs to a column in the second matrix other than a column in which the second vector is located. Specifically, when the matrix computing apparatus includes the second processing unit 134, the matrix computing apparatus 131 may be the chip 200 shown in FIG. 8 or
[0139] In a possible implementation, the matrix computing apparatus further includes a second scheduling unit 135 and a third processing unit 136. The second scheduling unit 135 generates a second selection signal based on a bitmap of a fourth vector. The second selection signal is used to cause the third processing unit 136 to obtain a fourth group of non-zero elements in the fourth vector, and cause the third processing unit 136 to obtain a fifth group of elements in the second vector. The bitmap of the fourth vector indicates a non-zero element in the fourth vector. The third processing unit 136 is configured to implement an operation between the fourth vector and the second vector based on the fourth group of non-zero elements and the fifth group of elements.
[0140] Specifically, when the matrix computing apparatus further includes the second processing unit 134, the second scheduling unit 135, and the third processing unit 136, the matrix computing apparatus 131 may be the chip 300 shown in
[0141]
[0142] The chip 143 may be any one of the chip 100, the chip 200, or the chip 300, and can assist the computing device 141 in implementing various functions implemented by the chip 100, the chip 200, or the chip 300.
[0143] The chip 143 can implement, under scheduling of the processor 142, the operations in embodiments corresponding to
[0144] The memory 144 may be a nonvolatile memory, for example, a read-only memory (ROM), a programmable read-only memory (PROM), an erasable programmable read-only memory (EPROM), an electrically erasable programmable read-only memory (EEPROM), or a flash memory. The memory 144 may alternatively be a volatile memory. The volatile memory may be a random access memory (RAM), and is used as an external cache. By way of example but not limitation, many forms of RAMs may be used, for example, a static random access memory (SRAM), a dynamic random access memory (DRAM), a synchronous dynamic random access memory (SDRAM), a double data rate synchronous dynamic random access memory (DDR SDRAM), an enhanced synchronous dynamic random access memory (ESDRAM), a synchlink dynamic random access memory (SLDRAM), and a direct rambus dynamic random access memory (DR RAM).
[0145] The memory 144 may be configured to store program code and data, for example, buffer the foregoing vector or matrix, so that the chip 143 invokes the program code stored in the memory 144 to perform the operation steps in embodiments corresponding to
[0146] The communication interface 145 may be a wired interface (for example, an Ethernet interface), an internal interface (for example, a peripheral component interconnect express (PCIE) bus interface), a wired interface (for example, an Ethernet interface), or a wireless interface (for example, a cellular network interface or a wireless local area network interface). The communication interface 145 is configured to communicate with another computing device or module.
[0147] The bus 146 is a peripheral component interconnect express (PCIE) bus, an extended industry standard architecture (EISA) bus, a unified bus (U bus, or UB), a compute express link (CXL), a cache coherent interconnect for accelerators (CCIX) bus, or the like. The bus 146 includes an out-of-band bus, a high-speed bus, and the like. For clear description, various buses are marked as the bus 146 in the figure.
[0148] It should be noted that
[0149] An embodiment of this application provides a computer-readable storage medium. The computer-readable storage medium stores computer instructions. When the computer instructions are run on a computing device, the computing device is caused to perform the operations in embodiments corresponding to
[0150] An embodiment of this application provides a computer program product including instructions, including a computer program or instructions. When the computer program or the instructions are run on a computer, the computing device is caused to perform the operations in embodiments corresponding to
[0151] All or a part of the foregoing embodiments may be implemented by software, hardware, firmware, or any combination thereof. When software is used to implement embodiments, all or a part of the foregoing embodiments may be implemented in a form of a computer program product. The computer program product includes at least one computer instruction. When the computer program instruction is loaded or executed on a computer, procedure or functions according to embodiments of this application are all or partially generated. The computer is a general-purpose computer, a dedicated computer, a computer network, or another programmable apparatus. The computer instruction may be stored in a computer-readable storage medium, or may be transmitted from a computer-readable storage medium to another computer-readable storage medium. For example, the computer instruction is transmitted from a website, computer, server, or data center to another website, computer, server, or data center in a wired (for example, a coaxial cable, an optical fiber, or a digital subscriber line (DSL)) or wireless (for example, infrared, radio, or microwave) manner. The computer-readable storage medium may be any usable medium accessible by the computer, or a data storage node, such as a server or a data center that integrates at least one usable medium. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk drive, or a magnetic tape), an optical medium (for example, a high density digital video disc (DVD)), or a semiconductor medium.
[0152] The foregoing descriptions are merely specific embodiments of the present invention, but are not intended to limit the protection scope of the present invention. Any modification or replacement readily figured out by a person skilled in the art within the technical scope disclosed in the present invention shall fall within the protection scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.