METHOD AND NON-TRANSITORY COMPUTER READABLE MEDIUM FOR COMPUTE-IN-MEMORY MACRO ARRANGEMENT, AND ELECTRONIC DEVICE APPLYING THE SAME
20220366216 · 2022-11-17
Assignee
Inventors
Cpc classification
G11C7/1063
PHYSICS
G06F15/7821
PHYSICS
Y02D10/00
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
G11C5/025
PHYSICS
G11C7/1006
PHYSICS
G11C7/1012
PHYSICS
International classification
Abstract
A method and a non-transitory computer readable medium for CIM arrangement, and an electronic device applying the same are proposed. The method for CIM arrangement includes to obtain information of the number of CIM macros and information of the dimension of each of the CIM micros, to obtain information of the number of input channels and the number of output channels of a designated convolutional layer of a designate neural network, and to determine a CIM macro arrangement for arranging the CIM macros according to the number of the CIM macros, the dimension of each of the CIM macros, the number of the input channels and the number of the output channels of the designated convolutional layer of the designated neural network, for applying convolution operation to the input channels to generate the output channels.
Claims
1. A method for compute-in-memory (CIM) macro arrangement comprising: obtaining information of the number of a plurality of CIM macros and information of a dimension of each of the CIM macros; obtaining information of the number of a plurality of input channels and the number of a plurality of output channels of a designated convolutional layer of a designated neural network; and determining a CIM macro arrangement for arranging the CIM macros according to the number of the CIM macros, the dimension of each of the CIM macros, the number of the input channels and the number of the output channels of the designated convolutional layer of the designated neural network, for applying convolution operation to the input channels to generate the output channels.
2. The method according to claim 1, wherein the step of determining the CIM macro arrangement according to the number of the CIM macros, the dimensions of each of the CIM macros, and the number of the input channels and the number of the output channels of the designated convolutional layer of the designated neural network comprises: determining the CIM macro arrangement capable of performing a convolution of a plurality of filters and the input channels according to latency, energy consumption, and utilization.
3. The method according to claim 2, wherein the determined CIM macro arrangement provides a summation of a vertical dimension of the CIM macros adapted for performing the convolution of the filters and the input channels of the designated convolution layer by a minimum number of times for batch loading the input channels.
4. The method according to claim 2, wherein the determined CIM macro arrangement provides a summation of a horizontal dimension of the CIM macros adapted for performing the convolution of the filters and the input channels of the designated convolution layer by a minimum number of times for batch loading the filters.
5. The method according to claim 2, wherein the latency is associated with at least one of a DRAM latency, a latency for loading weights into the CIM macros, and a processing time of the CIM macros, wherein the energy consumption is associated with energy cost for accessing at least one memory including an on-chip SRAM which is in a same chip as the CIM macros and a DRAM outside the chip, and wherein the utilization is a ratio of used part of the CIM macros to all of the CIM macros.
6. An electronic apparatus comprising: a plurality of compute-in-memory (CIM) macros, wherein the CIM macros are arranged in a predetermined CIM macro arrangement based on the number of the CIM macros, the dimensions of each of the CIM macros, and the number of a plurality of input channels and the number of a plurality of output channels of a designated convolutional layer of a designated neural network; and a processing circuit, configured to: load weights in the arranged CIM macros; and input a plurality of input channels of one input feature map into the arranged CIM macros with the loaded weights for a convolutional operation to generate an output activation of one of a plurality of output feature maps.
7. The electronic apparatus according to claim 6, wherein the processing circuit loads the weights of a plurality of filters in the arranged CIM macros based on the predetermined CIM macro arrangement, the number of the filters, height and width of each kernel of a plurality of kernels of each of the filters and the number of the kernels in each filter, wherein each of the kernels of each filter is respectively applied to a corresponding one of the input channels of the designated convolutional layer of the designated neural network.
8. The electronic apparatus according to claim 6, wherein the processing circuit loads each of the filters into the arranged CIM macros columnwisely.
9. The electronic apparatus according to claim 6, wherein the processing circuit determines whether to batch loads the weights of the plurality of filters in the arranged CIM macros based on the height and width of each kernel and a summation of a horizontal dimension of the arranged CIM macro.
10. A non-transitory computer readable medium storing a program causing a computer to: obtaining information of the number of a plurality of CIM macros and information of a dimension of each of the CIM macros; obtaining information of the number of a plurality of input channels and the number of a plurality of output channels of a designated convolutional layer of a designated neural network; and determining a CIM macro arrangement for arranging the CIM macros according to the number of the CIM macros, the dimension of each of the CIM macros, the number of the input channels and the number of the output channels of the designated convolutional layer of the designated neural network, for applying convolution operation to the input channels to generate the output channels.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0009] The accompanying drawings are included to provide a further understanding of the disclosure, and are incorporated in and constitute a part of this specification. The drawings illustrate embodiments of the disclosure and, together with the description, serve to explain the principles of the disclosure.
[0010]
[0011]
[0012]
[0013]
[0014]
[0015]
[0016]
[0017]
[0018] To make the above features and advantages of the application more comprehensible, several embodiments accompanied with drawings are described in detail as follows.
DESCRIPTION OF THE EMBODIMENTS
[0019] A common form of deep neural networks (DNNs) are convolutional neural networks (CNNs), which are composed of multiple convolutional layers. In such networks, each convolutional layer takes input activation data and generates higher-level abstraction of the input data, called a feature map, which preserves essential yet unique information. Each of the convolutional layers in CNNs is primarily composed of high-dimensional convolutions. For example,
[0020] Referring to
[0021] To solve the prominent issue, some embodiments of the disclosure will now be described more fully hereinafter with reference to the accompanying drawings, in which some, but not all embodiments of the application are shown. Indeed, various embodiments of the disclosure may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will satisfy applicable legal requirements. Like reference numerals refer to like elements throughout.
[0022]
[0023] Referring to
[0024] In the present exemplary embodiment, the CIM macro arrangement capable of performing a convolution of multiple filters and the input channels is determined according to latency, energy consumption, and utilization. The latency is associated with at least one of a DRAM latency, a latency for loading weights into the CIM macros, and a processing time of the CIM macros. Herein, the weights means parameters of the filters, and the number of parameters of the filters equals to FX×FY×IC×OC. Energy is a factor representing energy cost for computing a convolution layer by using a type of CIM macro arrangement, and the energy consumption is associated with energy cost for accessing at least one memory including an on-chip SRAM which is in the same chip as the CIM macros and a DRAM is outside the chip. The utilization is a ratio of used part of the CIM macros to all of the CIM macros. For example, a ratio of DRAM:SRAM:CIM=200:6:1 means that it coasts 6 times the energy cost for accessing CIM to access RAM based on accessing the same amount of data.
[0025] In one scenario, the determined CIM macro arrangement may provide a summation of the vertical dimension of all the CIM macros adapted for performing the convolution of the filters and the input channels of the designated convolution layer by a minimum number of times for batch loading the input channels. In another scenario, the determined CIM macro arrangement may provide a summation of the horizontal dimension of all the CIM macros adapted for performing the convolution of the filters and the input channels of the designated convolution layer by a minimum number of times for batch loading the filters.
[0026] For intuitive explanation of how to effectively use multiple CIM macros to maximize computation performance,
[0027] Referring to
[0028] For better comprehension,
[0029] Referring to
[0030] Referring to
[0031] For the column of the filter F0,
Output[OX=0][OY=0][OC=0]=Σ.sub.IC=0.sup.IC=512 F0(IC)×Input(OX=0,OY=0,IC), and
[0032] For the column of the filter F1,
Output[OX=0][OY=0][OC=1]=Σ.sub.IC=0.sup.IC=512F1(IC)×Input(OX=0,OY=0,IC).
[0033] The convolution operation for the remaining 64 filters F65, F66, . . . , F127 would be similar.
[0034] In the same case of using two CIM macros, each CIM macro having 256 rows and 64 columns, horizontally-arranged CIM macros can also be used for computing the convolution. In such a case, the first-half input channels 1-256 may be input to each column of total 128 columns (which respectively store 128 filters in advance) of the two horizontally-arranged CIM macros, and 256 multiplication results of each column are summed by the CIM macro to be an output value. However, such an output value cannot be as a complete convolution output since the second-half input channels 257-512 are not calculated yet. These output values (incomplete convolution outputs) have to be stored in an accumulation buffer (either SRAM or DFFs). Until the convolution operation for the second-half input channels 257-512 are also completed, two parts of incomplete convolution outputs are added to generate 128 convolution outputs. In such as case, more energy is spent on accessing the accumulation buffer, so it is less efficient than using the two vertically-arranged CIM macros.
[0035] Next, assume that the number of input channels is 128 and the number of output channels is 512. Since each micro has 256 rows (which is greater than 128), it is not necessary to arrange two CIM macros vertically. A single CIM macro would be able to complete the convolution operation for input channels 1-256 (i.e. the utilization for a single CIM macro is only 50%). In this case, an efficient CIM macro arrangement for computing the convolution may be a horizontal CIM arrangement as illustrated in
[0036] Referring to
[0037] Different products may apply different CNN architecture for data processing. For example, a surveillance system may apply a CNN architecture A for data processing, while a surgical instrument may apply a CNN architecture B for data processing. Based on configuration (i.e. OX, OY, IC, OC, FX, FY, . . . etc.) of convolutional layers of the CNN architecture a product selects, a proper CIM macro arrangement for the product can be predetermined by offline tool.
[0038] Once the CIM macro arrangement for the product is determined offline,
[0039] Referring to
[0040] In practical application,
[0041] Referring to
[0042] In an example, the weights of filters may be loaded into the CIM macros firstly, and then the input channels (the input feature maps) may be input to the CIM macros for convolutional operation. In another example, the input channels may be loaded to the CIM macros firstly, and then the weights may be input to the CIM macros for convolutional operation.
[0043] In the present exemplary embodiment, the processing circuit 810 loads the weights of multiple filters in the arranged CIM macros based on the predetermined CIM macro arrangement, the number of the filters, height and width of each kernel of each of the filters and the number of the kernels in each filter, where each of the kernels of each filter is respectively applied to a corresponding one of the input channels of the designated convolutional layer of the designated neural network.
[0044] In one exemplary embodiment, the processing circuit 820 loads each of the filters into the arranged CIM macros columnwisely. The processing circuit 820 may determine whether to batch loads the weights of the filters in the arranged CIM macros based on the height and width of each kernel and a summation of a horizontal dimension of the arranged CIM macro.
[0045] The disclosure also provides a non-transitory computer readable recording medium, which records computer program to be loaded into a computer system to execute the steps of the proposed method. The computer program is composed of multiple program instructions. Once the program sections are loaded into the computer system and executed by the same, the steps of the proposed method would be accomplished.
[0046] In view of the aforementioned descriptions, the proposed technique allows to effectively use multiple CIM macros with an optimum configuration to maximize computation performance.
[0047] No element, act, or instruction used in the detailed description of disclosed embodiments of the present application should be construed as absolutely critical or essential to the present disclosure unless explicitly described as such. Also, as used herein, each of the indefinite articles “a” and “an” could include more than one item. If only one item is intended, the terms “a single” or similar languages would be used. Furthermore, the terms “any of” followed by a listing of a plurality of items and/or a plurality of categories of items, as used herein, are intended to include “any of”, “any combination of”, “any multiple of”, and/or “any combination of multiples of” the items and/or the categories of items, individually or in conjunction with other items and/or other categories of items. Further, as used herein, the term “set” is intended to include any number of items, including zero. Further, as used herein, the term “number” is intended to include any number, including zero.
[0048] It will be apparent to those skilled in the art that various modifications and variations can be made to the structure of the disclosed embodiments without departing from the scope or spirit of the disclosure. In view of the foregoing, it is intended that the disclosure cover modifications and variations of this disclosure provided they fall within the scope of the following claims and their equivalents.