Methods and apparatus for using circular addressing in convolutional operation
10936487 ยท 2021-03-02
Inventors
- Delin Li (Beijing, CN)
- Zhenjiang Wang (Beijing, CN)
- Wenhui CAO (Beijing, CN)
- Kun Lin (Beijing, CN)
- Liang Chen (Beijing, CN)
- Jianjun Li (Beijing, CN)
- Chang Huang (Beijing, CN)
Cpc classification
International classification
Abstract
A method and apparatus are disclosed to perform the circular addressing to emulate a virtually unlimited memory space despite the fixed capacity of a physical memory by readdressing the portion of the data that exceeds the pre-defined length of the circular addressing region to another pre-defined address in the circular addressing region. Data segments in a data sample can be loaded and computed with recalculated circular addresses for different applications.
Claims
1. A method for performing circular addressing, comprising: a. defining a circular addressing region in a memory system, wherein the circular addressing region has an upper boundary with an address A.sub.x and a lower boundary with an address A.sub.y; b. dividing a data sample into more than one consecutive data segments, wherein more than one consecutive data segments include segments D.sub.0 to D.sub.n, wherein n is equal to or greater than 1; c. providing an instruction set for execution, wherein the instruction set is configured to compute an address A.sub.i, wherein the data segment D.sub.i is stored in the circular address region starting from the address A.sub.i, wherein the data segment D.sub.i has a length of L comprising a first portion with a length of L.sub.1 and a second portion with a length L.sub.2, wherein the first portion of the data segment D.sub.i overlaps with data segment D.sub.i1, wherein i is greater than 0, and is less than or is equal to n; d. executing the instruction set to compute the address A.sub.i based on an address A.sub.i1, wherein data segment D.sub.i1 is stored in the circular address region starting from the address A.sub.i1; wherein the address A.sub.i is computed as A.sub.x+(A.sub.i1A.sub.x+L.sub.2) % (A.sub.yA.sub.x+1); wherein % indicates a modulo operation that finds the remainder after division of (A.sub.i1A.sub.x+L.sub.2) by (A.sub.yA.sub.x+1); and e. storing the data segment D.sub.i starting at the address A.sub.i.
2. A method for performing circular addressing, comprising a. defining a circular addressing region in a memory system, wherein the circular addressing region has a width and a depth, wherein the width of circular addressing region is divided into one or more slices including Slice_.sub.a to Slice_.sub.b and the available width of the circular addressing region is defined as (Slice_.sub.aSlice_.sub.b+1); wherein the depth of circular addressing region has an upper boundary with an address of Offset_.sub.x and a lower boundary with an address of Offset_.sub.y, and the available depth of the circular addressing region is defined as (Offset_.sub.yOffset_.sub.x+1); b. dividing a data sample into more than one consecutive data segments, wherein more than one consecutive data segments include segments D.sub.0 to D.sub.n, wherein n is equal to or greater than 0; c. providing an instruction set for execution, wherein the instruction set is configured to compute (Slice_.sub.i, Offset_.sub.j) for data segment D.sub.i, wherein the data segment D.sub.i is stored in in Slice_.sub.i starting from the address Offset_.sub.j in the circular addressing region, wherein the data segment D.sub.i has a length of L comprising a first portion with a length of L.sub.1 and a second portion with a length L.sub.2, wherein the first portion of the data segment D.sub.i overlaps with the data segment D.sub.i1, wherein i is equal or greater than 0, and is less than or is equal to n; d. executing the instruction set to compute (Slice_.sub.i, Offset_.sub.j) for data segment D.sub.i based on (Slice_.sub.i1, Offset_.sub.j1), wherein data segment D.sub.i1 is stored in in Slice_.sub.i1 starting from the address Offset_.sub.j1 in the circular addressing region; wherein Offset_.sub.j is computed as Offset_.sub.x+(Offset_.sub.i1Offset_.sub.x+L) % (Offset_.sub.yOffset_.sub.x+1); wherein the address Slice_.sub.i is computed as Slice_.sub.a+(Slice_.sub.i1Slice_.sub.a+(Offset_.sub.i1Offset_.sub.x+L)/(Offset_.sub.yOffset_.sub.x+1))% (Slice_.sub.bSlice_.sub.a+1); wherein % indicates a modulo operation that finds the remainder after division of one by another; and e. storing the data segment D.sub.i in Slice_.sub.i in width, starting at the Offset_.sub.i in depth.
3. The method of claim 2, further comprising transferring and multiplexing data segment D.sub.i.
4. The method of claim 2, further comprises performing a computation of the data segment D.sub.i.
5. The method of claim 4, wherein the computation comprises a convolutional operation.
6. The method of claim 5, wherein the method is performed in a deep learning application.
7. The method of claim 2, wherein the memory system comprises a memory within an embedded device.
8. The method of claim 2, wherein the memory system comprises an SDRAM.
9. The method of claim 8, wherein the SDRAM comprises more than one circular addressing region.
10. The method of claim 2, wherein the instruction set is provided using one or more software modules.
11. The method of claim 2, wherein the method is performed using one or more hardware modules.
12. An apparatus for performing circular addressing, comprising, a. an integrated circuit device, wherein the integrated circuit comprising a data memory and a processor core; i. wherein the processor core further comprising a circular address generator; ii. wherein the circular address generator is configured to perform data memory addressing and access in said data memory; iii. Wherein said data memory including one or more circular addressing region having a width and a depth, wherein the width of one circular addressing region is divided into one or more slices including Slice_.sub.a to Slice_.sub.b and the available width of said circular addressing region is defined as (Slice_.sub.aSlice_.sub.b+1), wherein n is equal to or greater than 0; wherein the depth of said circular addressing region has an upper boundary with an address as Offset_.sub.x and a lower boundary with an address as Offset_.sub.y, and the available depth of the circular addressing region is defined as (Offset_.sub.yOffset_.sub.x+1); iv. wherein the circular address generator is configured to execute an instruction set to compute (Slice_.sub.i, Offset_.sub.j) for a data segment D.sub.i with a length of L, wherein the data segment D.sub.i is comprised in more than one consecutive data segments including segments D.sub.0 to D.sub.n, wherein n is equal to or greater than 0 and wherein i is equal or greater than 0, and is less than or is equal to n; wherein the data segment D.sub.i is stored in Slice_.sub.i starting from the address Offset_.sub.j in said circular addressing region; v. wherein the circular address generator is configured to compute (Slice_.sub.i, Offset_j) for data segment D.sub.i based on (Slice_.sub.i, Offset_.sub.j1), wherein data segment D.sub.i1 is stored in in Slice_.sub.i1 starting from the address Offset_.sub.j1 in the circular addressing region; wherein Offset_.sub.j is computed as Offset_.sub.x+(Offset_.sub.i1Offset_.sub.x+L) % (Offset_.sub.yOffset_.sub.x+1); wherein the address Slice_.sub.i is computed as Slice_.sub.a+(Slice_.sub.i1Slice_.sub.a+(Offset_.sub.i1Offset_.sub.x+L)/(Offset_.sub.yOffset_.sub.x+1))% (Slice_.sub.bSlice_.sub.a+1); wherein % indicates a modulo operation that finds the remainder after division of one by another; and vi. wherein processor core is further configured to store data segment D.sub.i in Slice_.sub.i in the circular addressing region, starting at the Offset_.sub.i in depth.
13. An apparatus of claim 12, wherein said data memory comprises an SDRAM.
14. An apparatus of claim 12, wherein said integrated circuit device further comprising a register file, wherein the register file including circular buffer program registers.
15. An apparatus of claim 12, wherein said processor core is further configured to transfer the data segment D.sub.i for multiplexing.
16. An apparatus of claim 12, wherein said processor core is further configured to perform a computation of the data segment D.sub.i.
17. An apparatus of claim 16, wherein the computation comprises a convolutional operation.
18. An apparatus of claim 16, wherein the apparatus further comprises another memory outside the integrated circuit device, wherein said memory is configured to store output from the computation of the data segment D.sub.i.
19. An apparatus of claim 16, wherein the computation is performed in a deep learning application.
20. An apparatus of claim 12, wherein the apparatus is a mobile handheld device.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
DETAILED DESCRIPTION
(9) The present principles are directed to memory circular addressing and may be used in applications, such as applications using convolutional operations in machine learning, to reduce the memory access bandwidth by avoiding duplicative caching of overlapping boundary data in neighboring data segments. The following discusses various embodiments in the context of CNN-based applications. However, the present embodiments can be adapted to other machine learning models and theorems, and may also be applied to other types of applications on the embedded devices such as image filtering including de-noising, de-blocking, de-blurring etc.
(10) An exemplary circular addressing method accordingly to one embodiment of the invention is illustrated in
(11)
(12) At step 301, the instruction is Fill SRAM (0, 160 KB). Under this instruction, starting with the address 0the beginning of the SRAM, the first segment of 160 KB data is loaded into the SRAM ranging from address 0 KB to address 160 KB. At step 302, the instruction is Compute SRAM (0 160 KB). The system is instructed to compute 160 KB data, starting at the address 0 KB. At step 303, the instruction is Fill SRAM (160K, 128 KB), which means starting at the address 160 KB, loading 128 KB data into the SRAM, thus having the second segment of 160 KB data cached between address 128 KB and address 288 KB including the 32 KB boundary data between address 128 KB and address 160 KB that overlaps with the first segment. At step 304, the instruction is Compute SRAM (128K, 160 KB), which means computing 160 KB data starting at address 128 KB, i.e. the second segment of data ranging from address 128 KB to address 288 KB. At step 305, the instruction is Fill SRAM (288K, 128 KB). Under this instruction, the system is instructed to start at the address 288 KB and load 128 KB data into the SRAM, thus having the third segment of 160 KB data cached between address 256 KB and address 416 KB including the 32 KB boundary data between address 256 KB and address 288 KB that overlaps with the second segment. At step 306, the instruction is Compute SRAM (256K, 160 KB), which means computing the 160 KB data starting at address 256 KB, i.e. the third segment of data ranging from address 256 KB to address 416 KB.
(13) At step 307, the instruction is Fill SRAM (416K, 128 KB). Under this instruction, starting at address 416 KB and loading 128 KB data into the SRAM would exceed the size of SRAM-480 KB. Instead of loading the entire fourth segment including the boundary data ranging from 384 KB to 416 KB to a new address, using circular addressing, the fourth data segment is divided into two portions, the first portion of 64 KB is loaded between address 416 KB and address 480 KB, and the second portion of 64 KB is loaded starting at address 0 KB. At step 308, the instruction is Computer SRAM (384K, 160K). Under this instruction, computing 160 KB data starting at address 384 KB can be performed by computing the boundary 32 KB data ranging from address 384 KB to 416 KB, the first portion of 64 KB ranging from address 416 KB to address 480 KB, the second portion of 64 KB ranging from address 0 to address 64 KB. At step 309, the instruction is Fill SRAM (544K, 128 KB). Under this instruction, filling 128 KB data starting at address 544 KB using circular addressing can be performed by loading 128 KB data into the SRAM at the start address of 64 KB-continuous from last address of the fourth segment, thus having the fifth segment of 160 KB data cached between address 32 KB and address 192 KB including the 32 KB boundary data between address 32 KB and address 64 KB that overlaps with the fourth segment. At step 310, the instruction is Compute SRAM (512K, 160 KB). Computing 160 KB data starting at address 512 KB using circular addressing under this instruction means computing 160 KB data starting at address 32 KB, i.e. the fifth segment of data ranging from address 32 KB to address 192 KB.
(14) The subsequent segments can be cached and computed in the same way: the segment will be loaded into the SRAM following the last segment when there is enough storage in the SRAM to cache the entire segment; when loading the segment would exceed the size of SRAM, it will be divided into two portions-one portion will be cached to the remaining storage of the SRAM and the other will be redirected to the beginning of the SRAM.
(15) While loading and computation of the next segment of data can occur after completion of processing of the last segment as shown in
(16) The circular addressing principle according to aforementioned embodiments of the present invention is illustrated using one-dimension (1-D) memory for simplification. The circular addressing principle can also be applied to a two-dimension (2-D) memory using hardware extension in another embodiment. As shown in
(17) Calculating the address for the next data segment to perform circular addressing using hardware modules according to an embodiment of the invention is further discussed below. Assuming the range of circular addressing is predefined to include the memory space between offset_1 (503) and 2 (504) of slice_a (501) and b (502). In the depth dimension, the start address is defined as offset_1 and the available depth for circular addressing is defined as (offset_2offset_+1). In the width dimension, the start address is defined as slice_a and the available width for circular addressing is defined as (slice_bslice_a+1). Assuming the current address is (offset_i, slice_j) and the length of data segment to be loaded is L, the new depth address offset_i can be calculated as offset 1+(offset_ioffset_1+L) % (offset_2offset_1+1). The new width address slice_j can be calculated as slice_a+(slice_islice_a+(offset_ioffset_+L)/(offset_2offset_1+1))% (slice_bslice_a+1).
(18) These calculations and additional circular addressing instructions can be provided through hardware modules including for example an address generator. A simplified diagram of an integrated circuit processor, which includes the address generator according to an embodiment of the present invention, is provided in
(19) After the data segments are cached in slice_a or slice_b using circular addressing as discussed above, the data segments can be transferred to and multiplexed in the MUX module (505) and then processed in the MAC array (506) for convolutional operations (or any other model-based multiplication and summarization operations) as also shown in
(20) Illustrated in
(21) In
(22) Assuming the convolutional kernel used in this embodiment has a size of 55 with stride 2 (kernels of other sizes can also be used), applying the convolutional kernel to the segments of the input feature would result in 264032 (40 KB) boundary overlapping data in computing the neighboring segments. Using circular addressing, only 160 KB non-overlapping data of the next segment need to be loaded following the overlapping 40 KB boundary data of the last segment. If loading the next segment exceeds the predefined SRAM range for circular addressing, e.g., defined by (slice_id, offset_id), circular addressing is applied to redirect the excess data in that segment to another location within the circular address range.
(23) In this example in
(24) In another embodiment of the present invention, software modules, for example a complier extension, can be used to provide additional circular addressing instructions in addition to the circular addressing instructions provided through hardware modules as discussed above. These additional software instructions enable caching segments of data continuously in virtually unlimited memory space despite the limited physical size of the SRAM and computing the cached segments, such as performing convolutional operations.
(25) The software-based instructions to process each data segment start at step 701. At step 702, it determines whether the size of remaining available SRAM for caching the input data segment is equal or bigger than the size of non-overlapping data of the input data segment. If the determination is yes, it goes to step 703 to load the non-overlapping data of the t input segment into the SRAM. At step 704, convolutional operations are performed on the cached non-overlapping and overlapping data of the segment. After the completion of the convolutional operations, the processing of this segment ends. On the other hand, if the size of remaining available SRAM for caching the input data segment is determined as not equal or bigger than the size of non-overlapping data of the input data segment at step 702, the instructions proceed to step 705 to calculate the size of the overlapping data. At step 706, calculation is performed to determine the size of remaining available SRAM before overriding unused data. At step 707, the same size of non-overlapping data of the segment that is equal to the remaining size of the available SRAM is loaded into the SRAM and the excess non-overlapping data of the segment is redirected to another predefined address using circular addressing as discussed above. At step 708, convolutional operations are performed on the overlapping data, the portion of non-overlapping data of the segment fitting into the remaining SRAM size, and the portion of non-overlapping data of the segment cached in a redirected address. At step 709, it checks whether the entire data segment has been proceeded. If not, it goes back to step 705 to cache and compute the remaining data in the data segment before it proceeds to the end.
(26) Each of the methods disclosed herein comprises one or more steps or actions for achieving the described method. The method steps and/or actions may be interchanged with one another and/or combined into a single step without departing from the scope of the claims. In other words, unless a specific order of steps or actions is required for proper operation of the method that is being described, the order and/or use of specific steps and/or actions may be modified without departing from the scope of the claims.
(27) It is to be understood that the claims are not limited to the precise configuration and components illustrated above. Various modifications, changes and variations may be made in the arrangement, operation and details of the systems, methods, and apparatus described herein without departing from the scope of the claims.