Method and apparatus for performing convolution operation on folded feature data
11500958 · 2022-11-15
Assignee
Inventors
Cpc classification
G06F7/501
PHYSICS
International classification
G06F17/15
PHYSICS
G06F7/501
PHYSICS
Abstract
Disclosed are a method and an apparatus for performing convolution operation on folded feature data. The method comprises: reading the folded feature data provided to a convolution layer and an original convolution kernel from a dynamic random access memory (DRAM); pre-processing the folded feature data and the original convolution kernel; storing the pre-processed folded feature data into a static random-access memory (SRAM); folding the pre-processed original convolution kernel in at least one dimension of width or height according to a folding manner of the folded feature data to generate one or more folded convolution kernels corresponding to the original convolution kernel; storing the one or more folded convolution kernels in the SRAM; and reading the pre-processed folded feature data and the one or more folded convolution kernels from the SRAM into a calculation unit for convolving the pre-processed folded feature data with the one or more folded convolution kernels.
Claims
1. A method for performing a convolution operation on folded feature data, comprising: reading the folded feature data provided to a convolution layer and an original convolution kernel from a dynamic random access memory (DRAM); pre-processing the folded feature data and the original convolution kernel; storing the pre-processed folded feature data into a static random-access memory (SRAM); folding the pre-processed original convolution kernel in at least one dimension of width or height according to a folding manner of the folded feature data to generate one or more folded convolution kernels corresponding to the original convolution kernel; storing the one or more folded convolution kernels in the SRAM; and reading the pre-processed folded feature data and the one or more folded convolution kernels from the SRAM into a calculation unit for convolving the pre-processed folded feature data with the one or more folded convolution kernels, wherein the folded feature data corresponds to a first feature data that is not folded in a first dimension, data of all C.sub.x channels in the (i.sub.fx×N.sub.x+j.sub.fx)th slice of the first feature data in the first dimension correspond to data of consecutive C.sub.x channels starting from the (j.sub.fx×C.sub.x)th channel in the (i.sub.fx)th slice of the folded feature data in the first dimension, where the first dimension is one of width or height, i.sub.fx is an integer greater than or equal to 0, N.sub.x is an integer greater than 1, j.sub.fx is an integer greater than or equal to 0 and less than N.sub.x, and C.sub.x is an integer greater than 0, and wherein when the original convolution kernel has a first stride in the first dimension that is equal to N.sub.x, each of the folded convolution kernels has a stride 1 in the first dimension.
2. The method of claim 1 wherein the SRAM includes a plurality of SRAM cells, at least every two pixels of the pre-processed folded feature data are stored in a same SRAM cell, and at least every two pixels of each folded convolution kernel are stored in a same SRAM cell.
3. The method of claim 1 wherein the calculation unit comprises a plurality of multipliers and a plurality of adders.
4. The method of claim 1 wherein the pre-processing comprises: determining a first padding quantity P.sub.1 for padding at a starting boundary of the first feature data in the first dimension according to a padding manner specified by the convolution layer, the first padding quantity P.sub.1 being greater than or equal to 0; padding ┌P.sub.1/N.sub.x┐ zero slices at a starting boundary of the folded feature data in the first dimension where “┌ ┐” represents an upward rounding operation; and padding ┌P.sub.1/N.sub.x┐×N.sub.x−P.sub.1 zero slices at a starting boundary of the original convolution kernel in the first dimension.
5. The method of claim 1 wherein the convolving comprises: in a case where the original convolution kernel has a first stride in the first dimension that is not equal to N.sub.x, moving all the folded convolution kernels corresponding to the original convolution kernel by the first stride in the first dimension after convolving a same portion of the pre-processed folded feature data with all the folded convolution kernels corresponding to the original convolution kernel.
6. The method of claim 5 wherein convolving a same portion of the pre-processed folded feature data with all the folded convolution kernels comprises: simultaneously calculating, by a plurality of multipliers, products of plural pixels in the pre-processed folded feature data each with a corresponding pixel of plural folded convolution kernels.
7. The method of claim 1 wherein the convolving comprises: in a case where the original convolution kernel has a first stride in the first dimension that is not equal to N.sub.x, convolving the entire pre-processed folded feature data with each of the folded convolution kernels corresponding to the original convolution kernel, respectively, where each of the folded convolution kernels has a stride in the first dimension that is equal to the first stride.
8. The method of claim 1 wherein folding the pre-processed original convolution kernel in at least one dimension comprises: padding k.sub.x×S.sub.x zero slices at a starting boundary of the pre-processed original convolution kernel in the first dimension to generate E.sub.x first transformed convolution kernels, respectively, where S.sub.x is a first stride of the original convolution kernel in the first dimension, E.sub.x is greater than or equal to 1 and depends on N.sub.x and S.sub.x, and k.sub.x is an integer greater than or equal to 0 and less than E.sub.x; and folding each first transformed convolution kernel in the first dimension by splicing every N.sub.x consecutive slices of the first transformed convolution kernel in the first dimension together in a depth dimension to generate a first folded convolution kernel corresponding to the first transformed convolution kernel.
9. The method of claim 8 wherein data of all C.sub.x channels in the (i.sub.kx×N.sub.x+j.sub.kx)th slice of each first transformed convolution kernel in the first dimension correspond to data of consecutive C.sub.x channels starting from the (j.sub.kx×C.sub.x)th channel in the (i.sub.kx)th slice of the corresponding first folded convolution kernel in the first dimension, respectively, where i.sub.kx is an integer greater than or equal to 0, and j.sub.kx is an integer greater than or equal to 0 and less than N.sub.x.
10. The method of claim 8 wherein the first feature data corresponds to a second feature data that is not folded in a second dimension, data of all C.sub.y channels in the (i.sub.fy×N.sub.y+j.sub.fy)th slice of the second feature data in the second dimension correspond to data of consecutive C.sub.y channels starting from the (j.sub.fy×C.sub.y)th channel in the (i.sub.fy)th slice of the first feature data in the second dimension, where the second dimension is the other one of width or height, i.sub.fy is an integer greater than or equal to 0, N.sub.y is an integer greater than 1, j.sub.fy is an integer greater than or equal to 0 and less than N.sub.y, and C.sub.y is an integer greater than 0.
11. The method of claim 10 wherein the pre-processing further comprises: determining a second padding quantity P.sub.2 for padding at a starting boundary of the second feature data in the second dimension according to the padding manner specified by the convolution layer, the second padding quantity P.sub.2 being greater than or equal to 0; padding ┌P.sub.2/N.sub.y┐ zero slices at a starting boundary of the folded feature data in the second dimension where “┌ ┐” represents an upward rounding operation; and padding ┌P.sub.2/N.sub.y┐×N.sub.y−P.sub.2 zero slices at a starting boundary of the original convolution kernel in the second dimension.
12. The method of claim 10 wherein folding the pre-processed original convolution kernel in at least one dimension further comprises: padding k.sub.y×S.sub.y zero slices at a starting boundary of each first folded convolution kernel in the second dimension to generate E.sub.y second transformed convolution kernels for each first folded convolution kernel, where S.sub.y is a second stride of the original convolution kernel in the second dimension, E.sub.y is greater than or equal to 1 and depends on N.sub.y and S.sub.y, and k.sub.x is an integer greater than or equal to 0 and less than E.sub.x; and folding each second transformed convolution kernel in the second dimension by splicing every N.sub.y consecutive slices of the second transformed convolution kernel in the second dimension together in the depth dimension to generate a second folded convolution kernel corresponding to the second transformed convolution kernel.
13. The method of claim 12 wherein data of all C.sub.y channels in the (i.sub.ky×N.sub.y+j.sub.ky)th slice of each second transformed convolution kernel in the second dimension correspond to data of consecutive C.sub.y channels starting from the (j.sub.ky×C.sub.y)th channel in the (i.sub.ky)th slice of the corresponding second folded convolution kernel in the second dimension, respectively, where i.sub.ky is an integer greater than or equal to 0, and j.sub.ky is an integer greater than or equal to 0 and less than N.sub.y.
14. The method of claim 10 wherein the convolving further comprises: in a case where the original convolution kernel has a second stride in the second dimension that is not equal to N.sub.y, moving all the folded convolution kernels corresponding to the original convolution kernel by the second stride in the second dimension after convolving a same portion of the pre-processed folded feature data with all the folded convolution kernels corresponding to the original convolution kernel.
15. The method of claim 10 wherein the convolving further comprises: in a case where the original convolution kernel has a second stride in the second dimension that is not equal to N.sub.y, convolving the entire pre-processed folded feature data with each of the folded convolution kernels corresponding to the original convolution kernel, respectively, where each of the folded convolution kernels has a stride in the second dimension that is equal to the second stride.
16. The method of claim 10 wherein when the original convolution kernel has a second stride in the second dimension that is equal to N.sub.y, each folded convolution kernel has a stride 1 in the second dimension.
17. An apparatus for performing a convolution operation on folded feature data comprising: one or more processors configured to execute instructions, execution of the instructions causing the one or more processors to perform the following steps: reading the folded feature data provided to a convolution layer and an original convolution kernel from a dynamic random access memory (DRAM); pre-processing the folded feature data and the original convolution kernel; storing the pre-processed folded feature data into a static random-access memory (SRAM); folding the pre-processed original convolution kernel in at least one dimension of width or height according to a folding manner of the folded feature data to generate one or more folded convolution kernels corresponding to the original convolution kernel; storing the one or more folded convolution kernels in the SRAM; and reading the pre-processed folded feature data and the one or more folded convolution kernels from the SRAM into a calculation unit for convolving the pre-processed folded feature data with the one or more folded convolution kernels, wherein the folded feature data corresponds to a first feature data that is not folded in a first dimension, data of all C.sub.x channels in the (i.sub.fx×N.sub.x+j.sub.fx)th slice of the first feature data in the first dimension correspond to data of consecutive C.sub.x channels starting from the (j.sub.fx×C.sub.x)th channel in the (i.sub.fx)th slice of the folded feature data in the first dimension, where the first dimension is one of width or height, i.sub.fx is an integer greater than or equal to 0, N.sub.x is an integer greater than 1, j.sub.fx is an integer greater than or equal to 0 and less than N.sub.x, and C.sub.x is an integer greater than 0, and wherein when the original convolution kernel has a first stride in the first dimension that is equal to N.sub.x, each of the folded convolution kernels has a stride 1 in the first dimension.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
DETAILED DESCRIPTION
(13) Feature data provided to a convolutional neural network may be regarded as a data cube that has a plurality of dimensions (i.e. different channels) such as width, height, depth, and the like. Each single data in the feature data may correspond to one point in the data cube. Each convolution kernel including weight parameters for the convolution operation in the convolutional neural network may also be regarded as a data cube.
(14) Usually, the term “slice” is used when describing a data cube. When three dimensions of the data cube is considered to correspond to dimensions represented by X-axis, Y-axis, and Z-axis in a three-dimensional Cartesian coordinate system, respectively, a slice of the data cube in a first dimension corresponding to the dimension represented by the X-axis may represent a result obtained by sampling the data in the data cube using a plane orthogonal to the X-axis, which is a data rectangle in a two-dimensional plane represented by the Y-axis and the Z-axis. Formulaically, if the data cube is regarded as a set of points Cube={(x, y, z)|x∈[0,W), y∈[0,H), z∈[0,D)} where W, H, and D are integers greater than 0, a slice of the data cube in the first dimension corresponding to the dimension represented by the X-axis may be represented as Slice.sub.i={(y, z)|x=i, y∈[0, H), z∈[0, D)}, where i∈[0, W). A slice in which all data are zero (or equivalent to zero) may be referred to as a zero slice.
(15) In addition, the term “pixel” is also usually used to describe the data cube. A pixel of the data cube includes points in the data cube that have the same width (X) and height (Y) coordinates and it may be represented as Pixel.sub.ij={(z)|x=i, y=j, z∈[0, D)}, where i∈[0, W) and j∈[0, H). As seen, a slice may include a plurality of pixels.
(16) Herein, for the convenience of description, the term “slice” is also used when describing data of a certain dimension in the feature data or the convolution kernel, for example, a slice in the width dimension (also referred to as a “width slice” for short), a slice in the height dimension (also referred to as a “height slice” for short), etc.
(17) Herein, when padding or appending of one or more zero slices in a first dimension (e.g., a width dimension) of a data cube A is mentioned, if may mean that the dimension value (e.g., width) of the first dimension of the data cube A is increased by adding one or more zero slices at a certain boundary of the data cube A in the first dimension (e.g., a left or right side in the width dimension), each of the added one or more zero slices having the same dimension values (e.g., a height value and a depth value) as the original data cube A in the other two dimensions (e.g., the height and depth dimensions).
(18) Herein, when padding or appending of one or more zero slices in a first dimension and a second dimension (e.g., a width dimension and a height dimension) of a data cube A is mentioned, it may mean that a dimension value (e.g., width) of the first dimension of the data cube A is increased by adding one or more zero slices at a certain boundary of the data cube A in the first dimension (e.g., a left or right side in the width dimension), each of the added one or more zero slices having the same dimension values (e.g., a height value and a depth value) as the original data cube A in the other two dimensions (e.g., the height and depth dimensions), and then a dimension value (e.g., height) of the second dimension of the data cube A′ resulting from increasing the first dimension value (e.g., width) of the data cube A is increased by adding one or more zero slices at a certain boundary of the data cube A′ in the second dimension (e.g., an upper or lower side In the height dimension), each of the added one or more zero slices having the same dimension values (e.g., a width value and a depth value) as the data cube A′ in the other two dimensions (e.g., the width and depth dimensions).
(19) Herein, when it is mentioned that respective slices of the data cube A are aligned to each other in depth, it may mean that zero (or a value equivalent to zero) is added in the depth direction to the slices (width slices or height slices) of the data cube A that do not have a desired depth value, such that the respective slices of the data cube A after zero-adding have the desired depth value.
(20) Herein, when it is mentioned that padding is performed on the data cube A in a first dimension and/or a second dimension, a number of the padded zero slices may be zero or one or more unless otherwise specified.
(21) The convolutional neural network is usually operation intensive, and it is desirable to perform operations in the convolutional neural network efficiently by using hardware such as a general purpose central processing unit (CPU), a graphics processing unit (GPU), or a dedicated accelerator. In order to improve operation efficiency and/or simplify hardware design, for example, a multi-channel memory may be designed to provide data to adders and/or multipliers that perform a convolution operation, or an operating unit (e.g. a multiplier circuit for performing a convolution operation) may be designed to support multi-channel (e.g. 32 channels) operation.
(22) In another aspect, feature data provided to an input layer of the convolutional neural network may usually have a small number of channels (usually 3 channels or just 1 channel), and feature data input to a convolutional layer relatively preceding in a feedforward reasoning direction of the convolution neural network may also have a small number of channels, causing low resource utilization of the memory and/or operator that support multiple channels at least at a certain stage in the entire feedforward reasoning process of the convolutional neural network. Therefore, conventional feature data may be folded in width and/or height to improve, for example, resource utilization of the memory that supports multiple channels.
(23) However, with a convolutional neural network architecture already being designed, a convolution operation cannot be performed directly on folded feature data using weight parameters in a corresponding convolutional layer. Accordingly, the folded feature data has to be unfolded into conventional unfolded feature data first, and then the unfolded feature data may be provided to the corresponding convolutional layer where the convolution operation is performed on the unfolded feature data using the weight parameters of the convolution layer. This means that the advantages produced by folding feature data are eliminated, again causing waste of hardware resources such as cache memories and/or multipliers, and causing many additional ineffective operations.
(24) Therefore, it is desirable to be able to perform a convolution operation directly on folded feature data with a convolutional neural network architecture that has already been designed.
(25) Folded feature data FD′ may be obtained by folding every N.sub.x slices (N.sub.x is an integer greater than 1) of original feature data FD in a width or height dimension D1 together in a depth dimension, and data of all C.sub.x channels of the (i.sub.fx×N.sub.x+j.sub.fx)th slice in the original feature data FD in the dimension D1 correspond to data of consecutive C.sub.x channels starting from the (j.sub.fx×C.sub.x)th channel of the (i.sub.fx)th slice in the folded feature data FD′ in the dimension D1, where i.sub.fx is an integer greater than or equal to 0, j.sub.fx is an integer greater than or equal to 0 and less than N.sub.x; and C.sub.x is an integer greater than 0.
(26) In addition, folded feature data FD″ may be obtained by folding the original feature data FD in both width and height dimensions. For example, the folded feature data FD″ may be obtained by further folding every N.sub.y slices (N.sub.y is an integer greater than 1) of the aforementioned folded feature data FD′ in the other dimension D2 of the width and height dimensions together in the depth dimension, and data of ail C.sub.y channels of the (i.sub.fy×N.sub.y+j.sub.fy)th slice in the folded feature data FD′ in the dimension D2 correspond to data of consecutive C.sub.y channels starting from the (j.sub.fy×C.sub.y)th channel of the (i.sub.fy)th slice in the folded feature data FD″ in the dimension D2, where i.sub.fy is an integer greater than or equal to 0, j.sub.fy is an integer greater than or equal to 0 and less than N.sub.y, and C.sub.y is an integer greater than 0.
(27)
(28)
(29) In
(30) Herein, the width, height and channel coordinates of each small cube in the folded feature data FD′ and FD″ shown in
(31) As shown in
(32) It should be understood that
(33) For the folded feature data, at least the folding manner (including the splicing number N.sub.x in association with folding in the dimension D1 and/or the splicing number N.sub.y in association with folding in the dimension D2) used to generate the folded feature data may be known in advance, in addition, the number of zero value data for aligning respective slices of the folded feature data in the depth dimension may also be known in advance.
(34) In addition, if it is clear in the context, the small cubes in the feature data (and the convolution kernel described below) may not be shown, and instead a plane may be used to represent a slice. For example, if three dimensions of width, height, and depth correspond to X-axis, Y-axis, and Z-axis in the three-dimensional Cartesian coordinate system, respectively, a plane perpendicular to the X-axis may be used to represent a width slice of the feature data (or the convolution kernel described below).
(35)
(36) As shown in
(37) Step S305, pre-processing the folded feature data and an original convolution kernel of a convolution layer;
(38) Step S310, folding the pre-processed original convolution kernel to generate one or more folded convolution kernels corresponding to the original convolution kernel; and
(39) Step S315, performing a convolution operation on the pre-processed folded feature data using the generated one or more folded convolution kernels.
(40) In case of a conventional convolution operation where an original convolution kernel is used to convolve an original unfolded feature data provided to a convolution layer, the original convolution kernel slides on the original unfolded feature data with a stride S.sub.x (greater than or equal to 1) in the width dimension and a stride S.sub.y (greater than or equal to 1) in the height dimension, and convolves a portion of data in the original unfolded feature data corresponding to the sliding window of the kernel in order to obtain desirable output feature data, before performing the convolution operation, zero slices may be padded in a predetermined manner around the original unfolded feature data in the width and height dimensions (including at starting and ending boundaries in the width dimension and at starting and ending boundaries in the height dimension), and the number of zero slices padded around the original unfolded feature data may depend on the predetermined padding scheme, for example, zero, one, or more. For a convolutional neural network that has been designed, weight parameters used in each convolution layer (including a number of convolution kernels, width, height and depth of each kernel, and values included in each kernel) and the padding scheme for the original unfolded feature data to be provided to the convolution layer are already known. Such configurations may be specified in advance by the designer of the convolutional neural network when she/he designs the convolutional neural network, or may be designed or adjusted by learning.
(41) When a folded feature data is provided to the convolution layer of the convolutional neural network, in order to ensure that a desirable correct result may still be obtained using the method according to an embodiment of the present disclosure, the folded feature data and the original convolution kernel may be pre-processed firstly in Step S305.
(42) In an embodiment, if the folded feature data received at the convolution layer is the folded feature data FD′ obtained by folding the original unfolded feature data FD in the width or height dimension D1 with the splicing number N.sub.x, a padding quantity P1 (P1≥0) for padding at the starting boundary of the original unfolded feature data FD in the dimension D1 may be determined according to the padding scheme specified by the convolution layer for the original unfolded feature data FD, and ┌P.sub.1/N.sub.x┐ zero slices may be padded at the starting boundary of the folded feature data FD′ in the dimension D1 where “┌ ┘” represents an upward rounding operation.
(43) For the ending boundary of the folded feature data FD′ in the dimension D1, a dimension value FV.sub.x′ of the folded feature data FD′ in the dimension D1 (e.g., a width value in a case that D1 is the width dimension), and a dimension value KV.sub.x and a stride S.sub.x of the original convolution kernel K of the weight parameters of the convolution layer in the dimension D1 may be determined firstly, if a result of calculating ((FV.sub.x′+┌P.sub.1/N.sub.x┐)×N.sub.x−KV.sub.x) is not an integral multiple of S.sub.x, P.sub.2′ zero slices may be padded at the ending boundary of the folded feature data FD′ in the dimension D1 so that a result of calculating ((FV.sub.x′+┌P.sub.1/N.sub.x┐)×N.sub.x−P.sub.2′) is an integral multiple of S.sub.x.
(44) For the ending boundary of the folded feature data FD′ in the dimension D1, an expected dimension value
(45)
of the folded convolution kernel in the dimension D1 may also be calculated firstly where (N.sub.x,S.sub.x) represents the greatest common divisor of N.sub.x and S.sub.x. Then, if N.sub.x≠S.sub.x, the padding quantity P.sub.2′ at the ending boundary of the folded feature data FD′ in the dimension D1 may be determined such that the result value of (P.sub.2′+┌P.sub.1/N.sub.x┐+FV.sub.x′−KV.sub.x′) is an integral multiple of S.sub.x; otherwise, the padding quantity P.sub.2′ at the ending boundary of the folded feature data FD′ in the dimension D1 may be determined such that P.sub.2′<KV.sub.x′.
(46) In addition, the starting and/or ending boundaries of the folded feature data FD′ in the other dimension D2 of width and height may be padded according to the padding scheme specified by the convolution layer to pad the original feature data FD in the dimension D2.
(47) In addition, ┌P.sub.1/N.sub.x┐×N.sub.x−P.sub.1 zero slices may be padded at the starting boundary of the original convolution kernel in the dimension D1.
(48) In other embodiments, if the folded feature data received at the convolution layer is the folded feature data FD″ obtained by folding the original unfolded feature data FD in both width and height dimensions, for example, by folding the original unfolded feature data FD firstly in the width or height dimension D1 with the splicing number N.sub.x to obtain the folded feature data FD′ and then folding the folded feature data FD′ in the other dimension D2 of width and height with the splicing number N.sub.y, the folded feature data FD″ may be padded according to the same padding scheme described above with respect to padding at the starting and ending boundaries of the folded feature data FD′ in the dimension D1.
(49) Then, the padding quantity P.sub.2(P.sub.2≥0) for padding at the starting boundary of the feature data FD in the dimension D2 may be determined according to the padding scheme specified by the convolution layer for the feature data PD, and ┌P.sub.2/N.sub.y┐ zero slices may be padded at the starting boundary of the folded feature data FD″ in the dimension D2.
(50) For the ending boundary of the folded feature data FD′ in the dimension D2, a dimension value FV.sub.y′ of the folded feature data FD″ in the dimension D2 (e.g., a height value in a case that D2 is the height dimension), and a dimension value KV.sub.y and a stride S.sub.y of the original convolution kernel K of the weight parameters of the convolution layer in the dimension D may be determined firstly. If a result of calculating ((FV.sub.y′+┌P.sub.2/N.sub.y┐)×N.sub.y−KV.sub.y) is not an integral multiple of S.sub.y, P.sub.3′ zero slices may be padded at the ending boundary of the folded feature data FD″ in the dimension D2 so that a result of calculating ((FV.sub.y′+┌P.sub.2/N.sub.y┐)×N.sub.y−KV.sub.y+P.sub.3′) is an integral multiple of S.sub.y.
(51) For the ending boundary of the folded feature data FD″ in the dimension D2, an expected dimension value
(52)
of the folded convolution kernel in the dimension D2 may also be calculated firstly, where (N.sub.y,S.sub.y) represents the greatest common divisor of N.sub.y and S.sub.y. Then, if N.sub.y≠S.sub.y, the padding quantity P.sub.3′ at the ending boundary of the folded feature data FD″ in the dimension D2 may be determined such that the result value of (P.sub.3′+┌P.sub.2/N.sub.y┐+FV.sub.y′−KV.sub.y′) is an integral multiple of S.sub.y; otherwise, the padding quantity P.sub.3′ at the ending boundary of the folded feature data FD″ in the dimension D2 may be determined such that P.sub.3′<KV.sub.y′;
(53) In addition, ┌P.sub.1/N.sub.x┐×N.sub.x−P.sub.1 zero slices may be padded at the starting boundary of the original convolution kernel K in the dimension D1, and ┌P.sub.2/N.sub.y┐×N.sub.y−P.sub.2 zero slices may be padded at the starting boundary of the original convolution kernel K in the dimension D2.
(54) For example, assume that in the example shown in
(55) Although only one original convolution kernel is shown in the example of
(56) After the folded feature data and the original convolution kernel are pre-processed, the exemplary method 300 may proceed to Step S310 to folding the pre-processed original convolution kernel.
(57) In Step S310, the pre-processed convolution kernel K′ may be padded with k.sub.x×S.sub.x zero slices at the starting boundary in the dimension D1 to generate one or more transformed convolution kernels K′[k.sub.x] corresponding to the original convolution kernel K or the pre-processed convolution kernel K′, where S.sub.x is the stride of the original convolution kernel K in the dimension D1, and k.sub.x is an integer greater than or equal to 0. For example, three transformed convolution kernels corresponding to the original convolution kernel K may be generated by using 0 zero slice, S.sub.x zero slices, and 2×S.sub.x zero slices, respectively.
(58) A maximum value of k.sub.x may be set to limit the number of the transformed convolution kernels. For example, k.sub.x<E.sub.x may be set where E.sub.x may be determined as a value obtained by dividing the least common multiple of S.sub.x and N.sub.x by S.sub.x, or a value obtained by dividing N.sub.x by the greatest common divisor of N.sub.x and S.sub.x, or a value equal to N.sub.x when S.sub.x=1 or S.sub.x and N.sub.x are relatively prime. Thus, E.sub.x transformed convolution kernels K′[k.sub.x] corresponding to the original convolution kernel K or the pre-processed convolution kernel K′ may be generated.
(59) Then, each transformed convolution kernel K′[k.sub.x] may be folded in the dimension D1 by splicing every N.sub.x slices consecutive in the dimension D1 together in the depth dimension to generate a corresponding folded convolution kernel K″[k.sub.x] such that data of all C.sub.x channels of the (i.sub.kx×N.sub.x+j.sub.kx)th slice in the folded convolution kernel K″[k.sub.x] in the dimension D1 correspond to data of consecutive C.sub.x channels starting from the (j.sub.kx×C.sub.x)th channel of the (i.sub.kx)th slice in the transformed convolution kernel K′[k.sub.x] in the dimension D1, respectively, where i.sub.kx is an integer greater than or equal to 0, and j.sub.kx is an integer greater than or equal to 0 and less than N.sub.x.
(60) The generated transformed convolution kernels K′[k.sub.x] may have different dimension values in the dimension D1 (e.g., width values in a case where D1 denotes width), or one or more transformed convolution kernels K′[k.sub.x] have a dimension value in the dimension D1 that is not an integral multiple of N.sub.x, so that slices in the corresponding folded convolution kernel K″[k.sub.x] are not aligned with each other in the depth dimension.
(61) In such a case, a desirable dimension value EV.sub.x of each transformed convolution kernel K′[k.sub.x] in the dimension D1 may be determined from E.sub.x, S.sub.x, N.sub.x and the dimension values V.sub.x of the pre-processed convolution kernel K′ in the dimension D1. For example, the desirable dimension value EV.sub.x of each transformed convolution kernel K′[k.sub.x] in the dimension D1 may be determined by an equation EV.sub.x=┌((E.sub.x−1)×S.sub.x+V.sub.x)/N.sub.x┐×N.sub.x. If the dimension value of the transformed convolution kernel K′[k.sub.x] in the dimension D1 is smaller than EV.sub.x, then the transformed convolution kernel K′[kx] may be adjusted by appending a zero slice(s) at the ending boundary of the transformed convolution kernel K′[k.sub.x] in the dimension D1 such that the dimension value of the adjusted transformed convolution kernel K′[k.sub.x] in the dimension D1 becomes EV.sub.x. Then, the adjusted transformed convolution kernel K′[k.sub.x] may be folded in the dimension D1 to generate the corresponding folded convolution kernel K″[k.sub.x].
(62) Characteristics or capability of hardware (e.g., memory or operator supporting multiple channels) may also be utilized directly. For example, in a case where hardware is capable of aligning the channels, a channel that does not have any real data may be automatically treated by the hardware as having a zero value. In such a case, the channels of each slice in the folded convolution kernel may be automatically aligned by the hardware. For example, if the hardware supports 32 channels simultaneously, the number of channels of each folded convolution kernel may be automatically aligned to 32 channels by the hardware.
(63) In an embodiment, if the folded feature data received at the convolution layer is obtained by only folding the original unfolded feature data FD in the dimension, then in Step S310, the obtained folded convolution kernel k″[k.sub.x] may be used as the final folded convolution kernel.
(64) In another embodiment, if the folded feature data received at the convolution layer is the folded feature data FD″ obtained by folding the original unfolded feature data FD in the dimension D1 using the splicing number N.sub.x to generate the folded feature data FD′ and then folding the folded feature data FD′ in the dimension D2 using the splicing number N.sub.y, then in Step S310, each folded convolution kernel K″[k.sub.x] may be further folded in the dimension D2 using the splicing number N.sub.y. The process of folding the folded convolution kernel K″[k.sub.x] in the dimension D2 using N.sub.y is similar to the process of folding the pre-processed convolution kernel K′ in the dimension D1 using N.sub.x.
(65) For example, the folded convolution kernel K″[k.sub.x] may be padded with k.sub.y×S.sub.y zero slices at the starting boundary in the dimension D2 to generate one or more transformed convolution kernels K″[k.sub.x, k.sub.y] corresponding to the folded convolution kernel K″[k.sub.x], where S.sub.y is the stride of the original convolution kernel K in the dimension D2, and k.sub.y is an integer greater than or equal to 0. Also, a maximum value of k.sub.y may be set to limit the number of transformed convolution kernels K″[k.sub.x,k.sub.y]. For example, k.sub.y may be set to be less than E.sub.y (k.sub.y<E.sub.y) where E.sub.y may be determined as a value obtained by dividing the least common multiple of S.sub.y and N.sub.y by S.sub.y, or a value obtained by dividing N.sub.y by the greatest common divisor of N.sub.y and S.sub.y, or a value equal to N.sub.y in a case where S.sub.y=1 or S.sub.y and N.sub.y are relatively prime. Thus, E.sub.y transformed convolution kernels K″[k.sub.x,k.sub.y] corresponding to the folded convolution kernel K″[k.sub.x] may be generated, or E.sub.x×E.sub.y transformed convolution kernels K″[k.sub.x,k.sub.y] corresponding to the convolution kernel K or the adjusted convolution kernel K′ may be generated.
(66) Then, each transformed convolution kernel K″[k.sub.x,k.sub.y] may be folded in the dimension D2 by splicing every N.sub.y slices consecutive in the dimension D2 together in the depth dimension to generate a corresponding folded convolution kernel K′″[k.sub.x,k.sub.y] such that data of all C.sub.y channels of the (i.sub.ky×N.sub.y+j.sub.ky)th slice in the folded convolution kernel K′″[k.sub.x,k.sub.y] in the dimension D2 correspond to data of consecutive C.sub.y channels starting from the (j.sub.ky×C.sub.y)th channel of the (i.sub.ky)th slice in the transformed convolution kernel K″[k.sub.x,k.sub.y] in the dimension D2, respectively, where i.sub.ky is an integer greater than or equal to 0, and j.sub.ky is an integer greater than or equal to 0 and less than N.sub.y.
(67) A desirable dimension value EV.sub.y of each transformed convolution kernel K″[k.sub.x,k.sub.y] in the dimension D2 may be determined from E.sub.y, S.sub.y, N.sub.y and the dimension values V.sub.y of the pre-processed convolution kernel K′ in the dimension D2. For example, the desirable dimension value EV.sub.y of each transformed convolution kernel K″[k.sub.x,k.sub.y] in the dimension D2 may be determined by an equation EV.sub.y=┌((E.sub.y−1)×S.sub.y+V.sub.y)/N.sub.y×N.sub.y. If the dimension value of the transformed convolution kernel K″[k.sub.x,k.sub.y] in the dimension D2 is smaller than EV.sub.y, then the transformed convolution kernel K″[k.sub.x,k.sub.y] may be adjusted by appending a zero slice(s) at the ending boundary of the transformed convolution kernel K″[k.sub.x,k.sub.y] in the dimension D2 such that the dimension value of the adjusted transformed convolution kernel K″[k.sub.x,k.sub.y] in the dimension D2 becomes EV.sub.y. Then, the adjusted transformed convolution kernel K″[k.sub.x,k.sub.y] may be folded in the dimension D2 to generate the corresponding folded convolution kernel k″[k.sub.x,k.sub.y].
(68) The obtained E.sub.x×E.sub.y folded convolution kernels K′″[k.sub.x, k.sub.y] may be used as the final folded convolution kernels.
(69)
(70) Then, the folded convolution kernels K″[0] and K″[1] each may be further folded in the height dimension. As shown in
(71) The exemplary method 300 may then proceed to Step S315 to perform a convolution operation on the pre-processed folded feature data obtained in Step S305 using the one or more folded convolution kernels obtained in Step S310.
(72) If the folded feature data received at the convolution layer is the folded feature data FD′ obtained by folding the original unfolded feature data FD in the dimension D1, then in Step S315, the convolution operation may be performed on the pre-processed folded feature data obtained in Step S305 using the E.sub.x folded convolution kernels K″[k.sub.x] obtained in Step S310. In such a case, if the original convolution kernel K has a stride S.sub.x in the dimension D1 that is equal to N.sub.x, then each folded convolution kernel K″[k.sub.x] has a stride 1 in the dimension D1; otherwise, the stride of each folded convolution kernel K″[k.sub.x] in the dimension D1 is equal to S.sub.x. In addition, each folded convolution kernel K″[k.sub.x] has a stride in the other dimension D2 of width and height that is equal to the stride S.sub.y of the original convolution kernel K in the dimension D2.
(73) If the folded feature data received at the convolution layer is the folded feature data FD″ obtained by folding the original unfolded feature data FD in the dimension D1 according to the splicing number N.sub.x to obtain the folded feature data FD′ and further folding the folded feature data FD′ in the dimension D2 according to the splicing number N.sub.y, then in Step S315, the convolution operation may be performed on the pre-processed folded feature data obtained in Step S305 using the E.sub.x×E.sub.y folded convolution kernel K′″[k.sub.x,k.sub.y] obtained in Step S310. In such a case, if the original convolution kernel K has a stride in the dimension D1 that is equal to N.sub.x, then each folded convolution kernel K′″[k.sub.x,k.sub.y] has a stride 1 in the dimension D1; otherwise, the stride of each folded convolution kernel K′″[k.sub.x, k.sub.y] in the dimension D1 is equal to S.sub.x. Further, the original convolution kernel K has a stride S.sub.y in the dimension D2 that is equal to N.sub.y, then each folded convolution kernel K′″[k.sub.x,k.sub.y] has a stride 1 in the dimension D2; otherwise, the stride of each folded convolution kernel K′″[k.sub.x,k.sub.y] in the dimension D2 is equal to S.sub.y.
(74) In an embodiment, in Step S315, all folded convolution kernels may be used to convolve a same portion of the folded feature data and then move a stride in the dimension D1 or D2 to convolve a next portion of the folded feature data, until all portions of the folded feature data have been convolved generating a final output feature data.
(75)
(76) As shown in
(77) After convolving the first and second rows of the folded feature data FD′″, the four folded convolution kernels K′″[0,0], K′″[0,1], K′″[1,0], and K′″[1,1] move a stride 1 (i.e. the stride of the original convolution kernel K in height) in height to convolve the second and third rows of the folded feature data FD′″. The convolution operation on the second and third rows of the folded feature data FD′″ is similar to the convolution operation on the first and second rows of the folded feature data FD′″ using the four folded convolution kernels K′″[0,0], K′″[0,1], K′″[1,0] and K′″[1,1] and a repetitive description thereof will be omitted here.
(78) After convolving the folded feature data FD′″ with the four folded convolution kernel K′″[0,0], K′″[1,0], K′″[1,0] and K′″[1,1], a final output feature data FDO is obtained. The last row of the output feature data FDO, i.e., data (4,1), (4,2), (4,3), (4,4), (4,5) and (4,8), may be retained or discarded as needed. For example, if a three-row output feature data is desirable by convolving the original unfolded feature data FD in
(79) In a case where the weight parameters of the convolution layer include a plurality of convolution kernels, the output feature data FDO in the example of
(80) In other embodiments, the folded convolution kernel may be used to convolve the entire folded feature data, respectively, in such a case, if does not need to modify convolution instructions for the hardware. However, if one original convolution kernel corresponds to a plurality of folded convolution kernels, a partial result obtained by using each folded convolution kernel will be in multiple channels. Before the output feature data is provided to a next layer of the convolution neural network or regarded as the final output of the entire convolution neural network, the partial result in multiple channels may be re-organized or unfolded to obtain a complete output in one channel.
(81)
(82) In the exemplary method 300, the folded feature data provided to the convolution layer may be directly convolved without firstly unfolding the folded feature data into a conventional unfolded feature data, thereby improving channel utilization and operation efficiency, and reducing cache consumption.
(83) For example, assume that a processing unit (for example, an array of multipliers for convolution operation) is capable of processing 32 channels simultaneously and the convolution kernel of the weight parameters has a width 5 and a height 5. For a width folded image obtained by width-folding a 3-channel 720×1280 RGB image with N.sub.x=2, the convolution operation performed directly on the width folded image according to the method of the present disclosure involves an amount of calculations only 60% of that for the convolution operation performed on an unfolded image according to a conventional method, with a rate of effective calculations about 2 times of that in the conventional method, even without considering an extra amount of calculations for unfolding the width folded image. For a width height folded image obtained by folding the 3-channel 720×1280 RGB image in width and height with N.sub.x=2, N.sub.y=2, the convolution operation performed directly on the width height folded image according to the method of the present disclosure involves an amount of calculations only 36% of that for the convolution operation performed on an unfolded image according to the conventional method, with a rate of effective calculations about 4 times of that in the conventional method, without considering an extra amount of calculations for unfolding the width height folded image.
(84)
(85) As shown in
(86) The processor 910 may be connected to a memory 920 and an I/O interface 930 through a bus system and/or an interconnection mechanism in other forms (not shown).
(87) The memory 920 may include a computer readable writable storage medium in various forms, for example, a volatile memory and/or a non-volatile memory. Examples of the volatile memory may include but not be limited to a random access memory (RAM) and/or a cache, etc. Examples of the non-volatile memory may include but not be limited to a read only memory (ROM), a hard disk, a flash memory, etc. Examples of the readable writable storage medium may include but not be limited to an electric, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device or any combination thereof. For example, when being used in combination with the neural network-dedicated processor, the memory 920 may be a RAM on a chip carrying the dedicated processor. The memory 920 may include program instructions for instructing the device 900 to perform the method of performing a convolution operation on folded feature data according to the embodiments of the present disclosure.
(88) The I/O interface 930 may serve to provide parameters or data to the processor 910 and output data processed by the processor 910.
(89) As shown in
(90) The pre-processing unit 1010 may be configured to pre-process the folded feature data provided to the convolution layer and the original convolution kernel, in an embodiment, the pre-processing unit 1010 may be configured to perform, for example, Step S305 in the exemplary method 300 as shown in
(91) The folding unit 1020 may be configured to fold the pre-processed original convolution kernel in at least one dimension of width or height according to the folding manner of the folded feature data to generate one or more folded convolution kernels corresponding to the original convolution kernel. In an embodiment, the folding unit 1020 may be configured to perform, for example, Step S310 in the exemplary method 300 as shown in
(92) The operating unit 1030 may be configured to perform a convolution operation on the pre-processed folded feature data using the generated one or more folded convolution kernels. In an embodiment, the operating unit 1030 may be configured to perform, for example, Step S315 of the exemplary method 300 shown in
(93) It should be understood that the apparatuses 900 and 1000 are shown in
(94)
(95) The host processor 1110 may be an ARM processor, a general-purpose Central Processing Unit (CPU), or any other types of processors or controller, and it can execute program instructions to control operation of other components in the device 1100 such as the DRAM 1120 and the convolution engine 1130 as described below.
(96) The DRAM 1120 may be a DDR RAM or any other types of DRAMs, and it can temporarily store data read from a non-volatile storage such as a magnetic hard disk. For example, the above-mentioned folded feature data and original convolution kernel for a convolution layer in a convolution neural network or program instructions to be executed by the host processor 1110 may be temporarily stored in the DRAM 1120.
(97) The convolution engine 1130 may read the folded feature data and the original convolution kernel from the DRAM 1120 to perform a convolution operation directly on the folded feature data in accordance with any one of the methods disclosed above. The convolution engine 1130 may be formed as a chip, and its components and operations will be discussed below in detail.
(98) Referring to
(99) In an embodiment, pre-processing and storing of the folded feature data may be performed In one step. For example, while the folded feature data read from the DRAM 1120 are being written into the SRAM 1131, additional zero values may be inserted into a data stream of the folded feature data so that the folded feature data stored in the SRAM 1131 are pre-processed (padded as described above).
(100) Referring to
(101) In addition, the pre-processed original convolution kernel may be folded before or white being stored in the SRAM 1131. As described above with reference to
(102) Referring back to
(103) The calculation results from the calculation unit 1133 may be stored in an output buffer (SRAM) 1135. The input buffer 1131 and the output buffer 1135 each are equipped with a buffer crossbar switch 1132, 1134 to control data provided to or received from the calculation unit 1133. If necessary, the calculation results may also be moved from the output buffer 1135 to the DRAM 1120.
(104) Unless the context clearly requires otherwise, throughout the description and the claims, the words “comprise”, “comprising” and the like are to be construed in an inclusive sense, as opposed to an exclusive or exhaustive sense; that is to say, in the sense of “including but not limited to”. The word “coupled”, as generally used herein, refers to two or more elements that may be either directly connected, or connected by way of one or more intermediate elements. Additionally, the words “herein”, “above”, “below”, and words of similar import, when used in this application, shall refer to this application as a whole and not to any particular portions of this application. Where the context permits, words in the above Description using the singular or plural number may also include the plural or singular number respectively. The word “or” in reference to a list of two or more items, that word covers all of the following interpretations of the word: any of the items in the list, all of the items in the list, and any combination of the items in the list.
(105) The above detailed description of embodiments of the invention is not intended to be exhaustive or to limit the invention to the precise form disclosed above. While specific embodiments of, and examples for, the invention are described above for illustrative purposes, various equivalent modifications are possible within the scope of the invention, as those skilled in the relevant art will recognize. For example, while processes or blocks are presented in a given order, alternative embodiments may perform routines having steps, or employ systems having blocks, in a different order, and some processes or blocks may be deleted, moved, added, subdivided, combined, and/or modified. Each of these processes or blocks may be implemented in a variety of different ways. Also, while processes or blocks are at times shown as being performed in series, these processes or blocks may instead be performed in parallel, or may be performed at different times.
(106) The teachings of the invention provided herein can be applied to other systems, not necessarily the system described above. The elements and acts of the various embodiments described above can be combined to provide further embodiments.
(107) While some embodiments of the inventions have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the disclosure. Indeed, the novel methods and systems described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the methods and systems described herein may be made without departing from the spirit of the disclosure. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the disclosure.