COMPUTING SYSTEM AND METHOD FOR POWER-SAVING COMPUTE-IN-MEMORY DESIGN

20250130950 ยท 2025-04-24

    Inventors

    Cpc classification

    International classification

    Abstract

    A computing system with power-saving compute-in-memory (CIM) design that minimizes the computation energy of the matrix-matrix multiplication is shown. A processor control unit loads A blocks divided from a matrix A.sub.MK from a second-level (L2) memory to a first-level (L1) memory, and loads B blocks divided from a matrix B.sub.KN from the L2 memory to a CIM memory. The A blocks buffered in the L1 memory are programmed to a register file to be entered into the CIM memory. The CIM memory performs multiply-and-accumulate (MAC) calculations on the A blocks and the B blocks to generate C blocks which form a matrix C.sub.MN(=A.sub.MKB.sub.KN). Based on the size of A.sub.MK and B.sub.KN, an A block buffering capability of the L1 memory, and a B block buffering capability of the CIM memory, the reuse scheme is properly selected to reuse the buffered A blocks and B blocks.

    Claims

    1. A computing system with a compute-in-memory (CIM) design, comprising: a CIM processor, including a processor control unit, a CIM memory with CIM capability, and a register file; and a two-level memory system coupled to the CIM processor, wherein the two-level memory system includes a first-level (L1) memory and a second-level (L2) memory; wherein: the processor control unit loads A blocks divided from a matrix A.sub.MK from the L2 memory to the L1 memory, and loads B blocks divided from a matrix B.sub.KN from the L2 memory to the CIM memory; the processor control unit further programs the A blocks buffered in the L1 memory to the register file to be entered into the CIM memory, wherein the CIM memory performs multiply-and-accumulate (MAC) calculations on the A blocks and the B blocks to generate C blocks which form a matrix C.sub.MN that is A.sub.MKB.sub.KN; and based on size of the matrix A.sub.MK, size of the matrix B.sub.KN, an A block buffering capability, , of the L1 memory, and a B block buffering capability, L, of the CIM memory, the CIM processor selects a reuse scheme to reuse the A blocks buffered in the L1 memory and the B blocks buffered in the CIM memory.

    2. The computing system as claimed in claim 1, wherein: the number of A blocks divided from the matrix A.sub.MK is M.sub.bK.sub.b, wherein M.sub.b=M/m, and K.sub.b=K/k; the number of B blocks divided from the matrix B.sub.KN is K.sub.bN.sub.b, wherein N.sub.b=N/n; and M.sub.b, K.sub.b, and No representing the size of the matrix A.sub.MK and the matrix B.sub.KN are used in selecting the reuse scheme.

    3. The computing system as claimed in claim 2, wherein: m=n=m.sub.c, and k=m.sub.c, where is greater than 0.

    4. The computing system as claimed in claim 3, wherein: in response to a situation wherein M.sub.b>1 and N.sub.b>1, the CIM processor judges K.sub.b to select the reuse scheme based on a threshold function T(), wherein the threshold function T() is a function of the A block buffering capability, , of the L1 memory and s.sub.s, the sparsity of the matrix B.sub.KN, where 1K.sub.b, and 0s.sub.s<1

    5. The computing system as claimed in claim 4, wherein: T ( .Math. ) = ( 1 - s s ) or T ( .Math. ) = T ( M b , N b ) = ( 1 - s s ) .Math. M b ( N b - 1 ) N b ( M b - 1 ) which further depends on M.sub.b and N.sub.b.

    6. The computing system as claimed in claim 5, wherein: when min(L.sub.max, K.sub.b)<T(), the CIM processor sets the B block buffering capability, L, of the CIM memory to min(2, K.sub.b), and selects a first reuse scheme first_reuse_A, where L.sub.max is an upper limit to which the CIM memory buffers the B blocks and, according to the first reuse scheme first_reuse_A, the A blocks buffered in the L1 memory are reused even if the CIM memory has updated the B blocks buffered therein.

    7. The computing system as claimed in claim 5, wherein: when min(L.sub.max, K.sub.b)=T(), the CIM processor sets the B block buffering capability, L, of the CIM memory to min(L.sub.max, K.sub.b), and selects a second reuse scheme first_reuse_B, where L.sub.max is an upper limit to which the CIM memory buffers the B blocks and, according to the second reuse scheme first_reuse_B, the B blocks buffered in the CIM memory are reused even if the L1 memory has updated the A blocks buffered therein.

    8. The computing system as claimed in claim 3, wherein: in response to a situation wherein N.sub.b>M.sub.b=1, the CIM processor sets the B block buffering capability, L, of the CIM memory to L=min(L.sub.max, K.sub.b), and selects a first reuse scheme first_reuse_A, by which the A blocks buffered in the L1 memory are reused even if the CIM memory has updated the B blocks buffered therein.

    9. The computing system as claimed in claim 3, wherein: in response to a situation wherein M.sub.b>N.sub.b=1, the CIM processor sets the B block buffering capability, L, of the CIM memory to L=min(L.sub.max, K.sub.b), and selects a second reuse scheme first_reuse_B, by which the B blocks buffered in the CIM memory are reused even if the L1 memory has updated the A blocks buffered therein.

    10. The computing system as claimed in claim 3, wherein: in response to a situation wherein M.sub.b=N.sub.b=1, the CIM processor sets the B block buffering capability, L, of the CIM memory to L=min(L.sub.max, K.sub.b), without changing the previously selected reuse scheme.

    11. A method for power saving of a compute-in-memory (CIM) design, comprising: loading A blocks divided from a matrix A.sub.MK from a second-level (L2) memory to a first-level (L1) memory; loading B blocks divided from a matrix B.sub.KN from the L2 memory to a CIM memory; programming the A blocks buffered in the L1 memory to a register file to be entered into the CIM memory, wherein the CIM memory performs multiply-and-accumulate (MAC) calculations on the A blocks and the B blocks to generate C blocks which form a matrix C.sub.MN that is A.sub.MKB.sub.KN; and based on the size of the matrix A.sub.MK, the size of the matrix B.sub.KN, an A block buffering capability, , of the L1 memory, and a B block buffering capability, L, of the CIM memory, selecting a reuse scheme to reuse the A blocks buffered in the L1 memory and the B blocks buffered in the CIM memory.

    12. The method as claimed in claim 11, wherein: the number of A blocks divided from the matrix A.sub.MK is M.sub.bK.sub.b, wherein M.sub.b=M/m, and K.sub.b=K/k; the number of B blocks divided from the matrix B.sub.KN is K.sub.bN.sub.b, wherein N.sub.b=N/n; and M.sub.b, K.sub.b, and N.sub.b representing the size of the matrix A.sub.MK and the matrix B.sub.KN are used in selecting the reuse scheme.

    13. The method as claimed in claim 12, wherein: m=n=m.sub.c, and k=m.sub.c, where is greater than 0.

    14. The method as claimed in claim 13, further comprising: in response to a situation wherein M.sub.b>1 and N.sub.b>1, judging K.sub.b to select the reuse scheme based on a threshold function T(), wherein the threshold function T() is a function of the A block buffering capability, , of the L1 memory, and s.sub.s, the sparsity of the matrix B.sub.KN, where 1K.sub.b, and 0s.sub.s<1.

    15. The method as claimed in claim 14, wherein: T ( .Math. ) = ( 1 - s s ) or T ( .Math. ) = T ( M b , N b ) = ( 1 - s s ) .Math. M b ( N b - 1 ) N b ( M b - 1 ) which further depends on M.sub.b and N.sub.b.

    16. The method as claimed in claim 15, further comprising: when min(L.sub.max, K.sub.b)<T(), setting the B block buffering capability, L, of the CIM memory to min(2, K.sub.b), and selecting a first reuse scheme first_reuse_A, where L.sub.max is an upper limit to which the CIM memory buffers the B blocks and, according to the first reuse scheme first_reuse_A, the A blocks buffered in the L1 memory are reused even if the CIM memory has updated the B blocks buffered therein.

    17. The method as claimed in claim 15, further comprising: when min(L.sub.max, K.sub.b)T(), setting the B block buffering capability, L, of the CIM memory to min(L.sub.max, K.sub.b), and selecting a second reuse scheme first_reuse_B, where L.sub.max is an upper limit to which the CIM memory buffers the B blocks and, according to the second reuse scheme first_reuse_B, the B blocks buffered in the CIM memory are reused even if the L1 memory has updated the A blocks buffered therein.

    18. The method as claimed in claim 13, further comprising: in response to a situation wherein N.sub.b>M.sub.b=1, setting the B block buffering capability, L, of the CIM memory to L=min(L.sub.max, K.sub.b), and selecting a first reuse scheme first_reuse_A, by which the A blocks buffered in the L1 memory are reused even if the CIM memory has updated the B blocks buffered therein.

    19. The method as claimed in claim 13, further comprising: in response to a situation wherein M.sub.b>N.sub.b=1, setting the B block buffering capability, L, of the CIM memory to L=min(L.sub.max, K.sub.b), and selecting a second reuse scheme first_reuse_B, by which the B blocks buffered in the CIM memory are reused even if the L1 memory has updated the A blocks buffered therein.

    20. The method as claimed in claim 13, further comprising: in response to a situation wherein M.sub.b=N.sub.b=1, setting the B block buffering capability, L, of the CIM memory to L=min(L.sub.max, K.sub.b), without changing the previously selected reuse scheme.

    Description

    BRIEF DESCRIPTION OF THE DRAWINGS

    [0016] The present invention can be more fully understood by reading the subsequent detailed description and examples with references made to the accompanying drawings, wherein:

    [0017] FIG. 1 depicts the concept of the CAKE algorithm;

    [0018] FIG. 2A depicts the concept of a first reuse scheme first_reuse_A;

    [0019] FIG. 2B depicts the concept of a second reuse scheme first_reuse_B;

    [0020] FIG. 3 depicts a computing system 300 in accordance with an exemplary embodiment of the present invention; and

    [0021] FIG. 4 is a flow diagram about a method for reuse scheme selection in accordance with an exemplary embodiment of the present invention.

    DETAILED DESCRIPTION OF THE INVENTION

    [0022] The following description is made for the purpose of illustrating the general principles of the invention and should not be taken in a limiting sense. The scope of the invention is best determined by reference to the appended claims.

    [0023] FIG. 3 depicts a computing system 300 in accordance with an exemplary embodiment of the present invention. The computing system 300 may be any computer device, or an edge device (e.g., a smart phone, or any computing devices near the network's edge). The computing system 300 includes a Compute-in-memory (CIM) processor 302 that has a processor control unit 304, a Compute-in-memory (CIM) memory 306, and a register file 308 coupled to the CIM memory 306. The computing system 300 further includes a two-level memory system (e.g., a cache system) that includes a first-level memory (a.k.a. L1 memory) 310, a second-level memory (a.k.a. L2 memory) 312, an L1 memory controller 314, and an L2 memory controller 316. The L2 memory controller 316 controls the L2 memory 312 and communicates with the L1 memory controller 314. The L1 memory controller 314 controls the L1 memory 310 and further communicates with the CIM processor 302. The computing system 300 further includes a computer readable medium 318 that stores a program about a novel CAKE algorithm introduced in the disclosure. The program instructions are read and cached in an instruction cache (not shown in the figure), to be executed by the processor control unit 304 to load data into the CIM memory 306 to complete a matrix-matrix multiplication.

    [0024] The matrix-matrix multiplication is C.sub.MN=A.sub.MKB.sub.KN. The matrix-matrix multiplication (MM) space (referring to the middle cube presented in FIG. 1) representing the matrix-matrix multiplication C.sub.MN=A.sub.MKB.sub.KN is partitioned into MM space units. Each MM space unit has dimensions mkn (referring to the left picture of FIG. 1). The matrix A.sub.MK is divided into M.sub.bK.sub.b blocks (2D blocks, named A blocks), where M.sub.b is M/m, and K.sub.b is K/k. The matrix B.sub.KN is divided into K.sub.bN.sub.b blocks (2D blocks, named B blocks), where N.sub.b is N/n. The matrix C.sub.MN is formed by M.sub.bN.sub.b blocks (2D blocks, named C blocks). Each C block (C.sub.b) is obtained from the multiply-and-accumulate (MAC) calculation of its corresponding A blocks (A.sub.b) and B blocks (B.sub.b).

    [0025] The A blocks and B blocks are stored in the L2 memory 312. The matrix B.sub.KN may have the sparsity s.sub.s (0s.sub.s<1) and can be compressed in case its entries are static (e.g. weights of the neural network layer for inference are static) so that B blocks may be stored in the compressed format in L2 memory 312. The sparsity s.sub.s of the matrix B.sub.KN will be set zero to indicate compression does not apply to the B blocks even though the matrix B.sub.KN has zero valued entries. The L2 memory controller 316 reads the A blocks and B blocks (perhaps compressed) from the L2 memory 312, decompress B blocks if necessary (i.e. the matrix B.sub.KN has zero valued entries and compression applies to B blocks so that s.sub.s will be greater than 0), and passes A blocks and decompressed B blocks (decompression will not be performed on B blocks if s.sub.s=0) to the L1 memory controller 314. The L1 memory controller 314 loads the A blocks (A.sub.b) to the L1 memory 310, and loads the B blocks (B.sub.b) to the CIM memory 306. The A block buffering capability of the L1 memory 310 is , where 1K.sub.b. It means that the number of A blocks buffed in the L1 memory 310 can be up to . The B block buffering capability of the CIM memory 306 is L, where 1LK.sub.b. It means that the number of B blocks buffed in the CIM memory 306 can be up to L. For MAC calculation that generates one C block C.sub.b, the A blocks read from the L1 memory 310 are programmed into the register file 308 to be entered into the CIM memory 306. In the CIM memory 306, the received A blocks are multiplied by the buffered B blocks (small-sized matrix-matrix multiplication) and the products are accumulated to form one C block C.sub.b (where C.sub.b+=A.sub.bB.sub.b, referring to the right picture in FIG. 1). Through the register file 308, the C block C.sub.b generated by the CIM memory 306 is temporarily stored in the L1 memory 310. In the background mode of the computing system 300, the C blocks are stored back to the L2 memory 312 through the L1 memory controller 314 and the L2 memory controller 316.

    [0026] In such a computing architecture, the power consumption is due to: reading of the two-level memory (referring to the L1 and L2 memories 310 and 312); multiply-and-accumulate (MAC) computing of the CIM memory 306; and accessing of the register file 308. In the disclosure, the focus is on fully utilizing the A block buffer of the L1 memory 310 and the B block buffer of the CIM memory 306, to suppress the power consumption of reading the L1 memory 310 and the L2 memory 312.

    [0027] The first reuse scheme first_reuse_A (referring to FIG. 2A) is implemented by the A block buffer provided by the L1 memory 310. The second reuse scheme first_reuse_B (referring to FIG. 2B) is implemented by the B block buffer provided by the CIM memory 306. According to the first reuse scheme first_reuse_A, the A blocks buffered in the L1 memory 310 are reused even if the CIM memory 306 has updated the B blocks buffered therein. According to the second reuse scheme first_reuse_B, the B blocks buffered in the CIM memory 306 are reused even if the L1 memory 310 has updated the A blocks buffered therein. By executing the program stored in the computer readable medium 318, an improved CAKE algorithm is applied to select a proper reuse scheme from the first reuse scheme first_reuse_A and the second reuse scheme first_reuse_B, and the power consumption due to reading the two-level memory is considerably suppressed.

    [0028] Different from the conventional technology that selects the reuse scheme based on only the size of the matrix A.sub.MK and the matrix B.sub.KN, the size of the A block buffer provided by the L1 memory 310 () and the size of the B block buffer provided by the CIM memory 306 (L) are also considered in the proposed technology about reuse scheme selection.

    [0029] In an exemplary embodiment, the size of an MM space unit (mnk) has the following characteristics: m=n=m.sub.c, and k=m.sub.c (>0). A variable denotes the energy consumption of reading the L2 memory 312 with a constant scaling .sub.1. A variable denotes the energy consumption of reading the L2 memory 312 and the L1 memory 310 (with a constant scaling .sub.1 and .sub.2, respectively). The sparsity of the matrix B.sub.KN is s.sub.s (0s.sub.s<1). When the first reuse scheme first_reuse_A is adopted, the energy consumption of reading the L1 and L2 memories 310 and 312 may be represented as: (+s.sub.s)M.sub.bK.sub.bN.sub.b+M.sub.b(1N.sub.b)+(1s.sub.s)L(1M.sub.b). When the second reuse scheme first_reuse_B is adopted, the energy consumption of reading the L1 and L2 memories 310 and 312 may be represented as: (+s.sub.s)M.sub.bK.sub.bN.sub.b+(1s.sub.s)LN.sub.b(1M.sub.b)+(1N.sub.b). By substituting =c (c>1) and neglecting in the energy consumption comparison, the energy consumption index values of the two different reuse schemes are:

    [00002] f A ( M b , K b , N b ) = ( c + 1 - s s ) M b K b N b + M b ( 1 - N b ) + ( 1 - s s ) L ( 1 - M b ) f B ( M b , K b , N b ) = ( c + 1 - s s ) M b K b N b + ( 1 - s s ) L N b ( 1 - M b ) + ( 1 - N b )

    [0030] If any of M.sub.b and N.sub.b is 1, f.sub.A(M.sub.b, K.sub.b, N.sub.b)=f.sub.B(M.sub.b, K.sub.b, N.sub.b), regardless of what the value L (the B block buffering capability provided by the CIM memory 306) is. If M.sub.b>1 and N.sub.b>1, the comparison between f.sub.A(M.sub.b, K.sub.b, N.sub.b) and f.sub.B(M.sub.b, K.sub.b, N.sub.b) depends on the magnitudes of M.sub.b, N.sub.b, and L.

    [0031] In an exemplary embodiment, a threshold function T() is proposed:

    [00003] T ( .Math. ) = ( 1 - s s )

    [0032] The threshold function T() as the function of u (the A block buffering capability of the L1 memory 310) and s.sub.s (0s.sub.s<1, the sparsity of the matrix B.sub.KN) is applied to judge the value of K.sub.b when M.sub.b>1 and N.sub.b>1. In another exemplary embodiment, the threshold function is:

    [00004] T ( .Math. ) = T ( M b , N b ) = ( 1 - s s ) .Math. M b ( N b - 1 ) N b ( M b - 1 )

    which further depends on M.sub.b and N.sub.b.

    [0033] When K.sub.b=T(), the B block buffering capability L provided by the CIM memory 306 is set to K.sub.b, and the second reuse scheme first_reuse_B is selected to reuse the B blocks buffered in the CIM memory 306 even if the L1 memory 310 has updated the A blocks buffered therein. According to the second reuse scheme first_reuse_B, every K.sub.b B blocks (e.g. a column of B blocks buffered in the CIM memory 306) are reused till the column of B blocks are processed by the all M.sub.b rows of A blocks for matrix-matrix multiplication.

    [0034] When K.sub.b<T(), the B block buffering capability L provided by the CIM memory 306 is set to 2, and the first reuse scheme first_reuse_A is selected to reuse the A blocks buffered in the L1 memory 310 even if the CIM memory 306 has updated the B blocks buffered therein. According to the first reuse scheme first_reuse_A, every u A blocks buffered in the L1 memory 310 are reused till being processed by the all related B blocks for matrix-matrix multiplication.

    [0035] In the simpler situations, the threshold function T() is not required. When M.sub.b>N.sub.b=1, the B block buffering capability L provided by the CIM memory 306 is set to a non-zero integer greater than or equal to 2, and the second reuse scheme first_reuse_B may be selected to reuse the B blocks buffered in the CIM memory 306 though the first reuse scheme first_reuse_A results in the same energy consumption index values as the second reuse scheme first_reuse_B because in some exemplary embodiments, the second reuse scheme first_reuse_B is only selected in this situation.

    [0036] When N.sub.b>M.sub.b=1, the B block buffering capability L provided by the CIM memory 306 is set to a non-zero integer greater than or equal to 2, and the first reuse scheme first_reuse_A may be selected to reuse the A blocks buffered in the L1 memory 310 though the second reuse scheme first_reuse_B results in the same energy consumption index values as the first reuse scheme first_reuse_A because in some exemplary embodiments, the first reuse scheme first_reuse_A is only selected in this situation.

    [0037] When M.sub.b=N.sub.b=1, the B block buffering capability L provided by the CIM memory 306 is set to a non-zero integer greater than or equal to 2. The first reuse scheme first_reuse_A and the second reuse scheme first_reuse_B result in the same power consumption. The reuse scheme can be kept at the previous setting.

    [0038] Table 1 shows the aforementioned reuse strategy.

    TABLE-US-00001 TABLE 1 Values of M.sub.b, N.sub.b Condition Energy comparison L setting reuse scheme M.sub.b = N.sub.b = 1 Not applied f.sub.A(M.sub.b, K.sub.b, N.sub.b) = non-zero first_reuse_A or f.sub.B(M.sub.b, K.sub.b, N.sub.b) integer 2 first_reuse_B M.sub.b > N.sub.b = 1 Not applied f.sub.A(M.sub.b, K.sub.b, N.sub.b) = non-zero first_reuse_B f.sub.B(M.sub.b, K.sub.b, N.sub.b) integer 2 N.sub.b > M.sub.b = 1 Not applied f.sub.A(M.sub.b, K.sub.b, N.sub.b) = non-zero first_reuse_A f.sub.B(M.sub.b, K.sub.b, N.sub.b) integer 2 N.sub.b > 1, M.sub.b > 1 K.sub.b T(.Math.) f.sub.A(M.sub.b, K.sub.b, N.sub.b) choose L = K.sub.b first_reuse_B f.sub.B(M.sub.b, K.sub.b, N.sub.b) K.sub.b < T(.Math.) f.sub.A(M.sub.b, K.sub.b, N.sub.b) < choose L = 2 first_reuse_A f.sub.B(M.sub.b, K.sub.b, N.sub.b)

    [0039] If the reuse scheme is selected only based on the matrix size as that taught in the conventional technology without considering the A block buffering capability of the L1 memory 310 and the B block buffering capability of L the CIM memory 306, the selected reuse scheme may not be the best reuse scheme. In some exemplary embodiments, the hardware limitation, L.sub.max, of the B block buffer provided by the CIM memory 306 is taken into consideration. The upper limit of B blocks buffered in the CIM memory 306 is L.sub.max.

    [0040] When min(L.sub.max, K.sub.b)=T(), the B block buffering capability L provided by the CIM memory 306 is set to min(L.sub.max, K.sub.b), and the second reuse scheme first_reuse_B is selected to reuse the B blocks buffered in the CIM memory 306. When K.sub.b<T(), the B block buffering capability L provided by the CIM memory 306 is set to min(2, K.sub.b), and the first reuse scheme first_reuse_A is selected to reuse the A blocks buffered in the L1 memory 310.

    [0041] Table 2 shows the reuse strategy considering the hardware limitation.

    TABLE-US-00002 TABLE 2 Values of M.sub.b, N.sub.b Condition Energy comparison L setting reuse scheme M.sub.b = N.sub.b = 1 Not applied f.sub.A(M.sub.b, K.sub.b, N.sub.b) = L = first_reuse_A or f.sub.B(M.sub.b, K.sub.b, N.sub.b) min(L.sub.max, K.sub.b) first_reuse_B M.sub.b > N.sub.b = 1 Not applied f.sub.A(M.sub.b, K.sub.b, N.sub.b) = L = first_reuse_B f.sub.B(M.sub.b, K.sub.b, N.sub.b) min(L.sub.max, K.sub.b) N.sub.b > M.sub.b = 1 Not applied f.sub.A(M.sub.b, K.sub.b, N.sub.b) = L = first_reuse_A f.sub.B(M.sub.b, K.sub.b, N.sub.b) min(L.sub.max, K.sub.b) N.sub.b > 1, M.sub.b > 1 min(L.sub.max, f.sub.A(M.sub.b, K.sub.b, N.sub.b) L = first_reuse_B K.sub.b) T(.Math.) f.sub.B(M.sub.b, K.sub.b, N.sub.b) min(L.sub.max, K.sub.b) K.sub.b < T(.Math.) f.sub.A(M.sub.b, K.sub.b, N.sub.b) < L = f.sub.B(M.sub.b, K.sub.b, N.sub.b) min(2, K.sub.b) first_reuse_A

    [0042] In table 2, when any of M.sub.b and N.sub.b is 1, the B block buffering capability L provided by the CIM memory 306 is set to min(L.sub.max, K.sub.b). For example, the B block buffering capability L provided by the CIM memory 306 may be set to a non-zero integer greater than or equal to 2.

    [0043] FIG. 4 is a flow diagram about a method for reuse scheme selection in accordance with an exemplary embodiment of the present invention.

    [0044] In step S402, the matrix A.sub.MK is divided into blocks A.sub.b, and the matrix B.sub.KN is divided into blocks B.sub.b. Each block A.sub.b is a small matrix of dimension mk, which is also named an A block. The total number of the A blocks A.sub.b is M.sub.bK.sub.b, where M.sub.b is M/m, K.sub.b is K/k. Each block B.sub.b is a small matrix of dimension kn, which is also named a B block. The total number of the B blocks B.sub.b is K.sub.bN.sub.b, where N.sub.b is N/n. Especially, m=n=m.sub.c, and k=m.sub.c (>0). The A blocks A.sub.b and the B blocks B.sub.b are stored into the L2 memory 312.

    [0045] In step S404, the values of M.sub.b and N.sub.b are checked. It is determined whether M.sub.b and N.sub.b both are greater than 1. If yes, step S406 is performed to compare min(L.sub.max, K.sub.b) with the threshold function T(), where L.sub.max is the hardware limitation of the B block buffer of the CIM memory 306, and the threshold function T() is:

    [00005] T ( .Math. ) = ( 1 - s s ) or T ( .Math. ) = T ( M b , N b ) = ( 1 - s s ) .Math. M b ( N b - 1 ) N b ( M b - 1 )

    where is the A block buffering capability provide by the L1 memory 310, and s.sub.s is the sparsity of the matrix B.sub.KN (0s.sub.s<1).

    [0046] If min(L.sub.max, K.sub.b)=T(), step S408 is performed to set the B block buffering capability L to min(L.sub.max, K.sub.b). The number of B blocks buffered in the CIM memory 306 in the same time is up to L. In step S410, the second reuse scheme first_reuse_B is selected. The B blocks buffered in the CIM memory 306 are reused till the CIM memory 306 completes the multiply-and-accumulate calculation between the buffered B blocks and all of its related A blocks. The L1 memory 310 is continuously updated by the required A blocks while the B blocks buffered in the CIM memory 306 are reused to complete the calculation of the target C block.

    [0047] If step S406 determines that min(L.sub.max, K.sub.b)<T(), step S412 is performed to set the B block buffering capability L to min(2, K.sub.b). In step S414, the first reuse scheme first_reuse_A is selected. The A blocks buffered in the L1 memory 310 are reused by the CIM memory 306 till the multiply-and-accumulate calculations between the reused A blocks and all of its related B blocks are finished. The CIM memory 306 is continuously updated by the required B blocks while the A blocks buffered in the L1 memory 310 are reused to complete the calculation of the target C block.

    [0048] If any of M.sub.b and N.sub.b is not greater than 1, step S416 is performed to further check whether any of M.sub.b and N.sub.b is 1. If M.sub.b=N.sub.b=1, step S418 is performed, by which the B block buffering capability L provide by the CIM memory 306 is set to min(L.sub.max, K.sub.b) (i.e. a non-zero integer greater or equal to 2). There is no need to change the reuse scheme. The reuse scheme may be the first reuse scheme first_reuse_A or the second reuse scheme first_reuse_B.

    [0049] If M.sub.b>N.sub.b=1, step S420 is performed, by which the B block buffering capability L provide by the CIM memory 306 is set to min(L.sub.max, K.sub.b) (i.e. a non-zero integer greater or equal to 2). In step S422, the second reuse scheme first_reuse_B is selected. The B blocks buffered in the CIM memory 306 are reused till the CIM memory 306 completes the multiply-and-accumulate calculation between the buffered B blocks and all of its related A blocks.

    [0050] If N.sub.b>M.sub.b=1, step S424 is performed, by which the B block buffering capability L provide by the CIM memory 306 is set to min(L.sub.max, K.sub.b) (i.e. a non-zero integer greater or equal to 2). In step S426, the first reuse scheme first_reuse_A is selected. The A blocks buffered in the L1 memory 310 are reused by the CIM memory 306 till the multiply-and-accumulate calculations between the reused A blocks and all of its related B blocks are finished.

    [0051] Based on the aforementioned concept, a method for power saving of a compute-in-memory (CIM) design is proposed in accordance with an exemplary embodiment of the present invention. The method includes loading A blocks divided from a matrix A.sub.MK from the L2 memory 312 to the L1 memory 310. The method includes loading B blocks divided from a matrix B.sub.KN from the L2 memory 312 to a CIM memory 306. The method includes programming the A blocks buffered in the L1 memory 310 to the register file 308 to be entered into the CIM memory 306. The CIM memory 306 performs multiply-and-accumulate (MAC) calculations on the A blocks and the B blocks to generate C blocks which form a matrix C.sub.MN that is A.sub.MKB.sub.KN. Based on the size of the matrix A.sub.MK, the size of the matrix B.sub.KN, the A block buffering capability, , of the L1 memory 310, and the B block buffering capability, L, of the CIM memory 306, the method includes selecting a reuse scheme to reuse the A blocks buffered in the L1 memory 310 and the B blocks buffered in the CIM memory 306. The computer readable medium 318 may be provided to configure the computing system 300 to perform the proposed method.

    [0052] There may be many variations of the CAKE algorithm with the proposed novel reuse strategy. Any reuse strategy that selects a reuse scheme based on not only the size of the matrices but also the buffering capability of the A blocks and the B blocks should be considered as being within the scope of the invention.

    [0053] While the invention has been described by way of example and in terms of the preferred embodiments, it should be understood that the invention is not limited to the disclosed embodiments. On the contrary, it is intended to cover various modifications and similar arrangements (as would be apparent to those skilled in the art). Therefore, the scope of the appended claims should be accorded the broadest interpretation so as to encompass all such modifications and similar arrangements.