MEMORY DEVICE WITH PROGRAMMABLE CIRCUITRY
20220100941 · 2022-03-31
Inventors
Cpc classification
G11C29/18
PHYSICS
G11C8/04
PHYSICS
G11C2029/1806
PHYSICS
G11C8/16
PHYSICS
G11C29/10
PHYSICS
G06F12/0284
PHYSICS
Y02D10/00
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
G11C7/1006
PHYSICS
G11C7/1012
PHYSICS
International classification
Abstract
The present disclosure relates to a memory device comprising a memory array and a periphery circuitry configured to read data from and/or write data to the memory array, wherein the periphery circuitry comprises a programmable circuitry causing the memory device to access data stored in the memory array in accordance with manifest loop instructions. The programmable circuitry comprises a control logic configured to control the operation of the periphery circuitry in accordance with a set of parameters derived from the manifest loop instructions. The present disclosure further relates to a method for controlling the operation of a memory device and to a processing system comprising the memory device.
Claims
1. A memory device, comprising: a memory array; and a periphery circuitry configured to access the memory array, wherein the periphery circuitry comprises a programmable circuitry configured to cause the memory device to access data stored in the memory array in accordance with manifest loop instructions, the programmable circuitry comprising a control logic configured to control the operation of the periphery circuitry in accordance with a set of parameters derived from the manifest loop instructions.
2. The memory device according to claim 1, wherein the programmable circuitry is configured, during operation, to receive the manifest loop instructions from an instruction set processor and to derive the set of parameters therefrom.
3. The memory device according to claim 1, wherein the set of parameters comprises a memory address pattern (U), a stride (S), and a number of loop iterations (L).
4. The memory device according to claim 3, wherein the programmable circuitry comprises a logic circuitry configured to perform a logical shift operation and to store the memory address pattern (U) and wherein the control logic is configured to shift the stored memory address pattern (U) in the logic circuitry in accordance with the stride parameter (S).
5. The memory device according to claim 4, wherein the logic circuitry comprises one or more rotating shift registers or one or more chains of shift registers.
6. The memory device according to claim 5, wherein the one or more rotating shift registers or the one or more chains of shift registers are hierarchically stacked and wherein the control logic is configured to control the shift registers such that the data stored in the memory array is processed in accordance with the manifest loop instructions.
7. The memory device according to claim 1, wherein the periphery circuitry is arranged to select one or more groups of consecutive rows of the memory array in accordance with the set of parameters.
8. The memory device according to claim 1, wherein the periphery circuitry is arranged to select one or more groups of consecutive columns of the memory array in accordance with the set of parameters.
9. The memory device according to claim 3, wherein the periphery circuitry comprises an output logic arranged to select columns of the memory array, and wherein the programmable circuitry is configured to control the operation of the output logic in accordance with the manifest loop instructions.
10. The memory device according to claim 3, wherein the periphery circuitry further comprises an input logic arranged to select rows to the memory array, and wherein the programmable circuitry is configured to control the operation of the input logic in accordance with the manifest loop instructions.
11. The memory device according to claim 9, wherein the programmable circuitry is configured, during operation, to derive a set of parameters for the input logic and the output logic respectively.
12. The memory device according to claim 11, wherein the memory address pattern (U) comprises a column selection pattern (Uc) and a row selection pattern (Ur).
13. The memory device according to claim 1, wherein the memory array is a two- or a higher-dimensional array.
14. The memory device according to claim 1, wherein the memory array comprises a nonvolatile memory array.
15. The memory device according to claim 1, wherein the memory array and the programmable circuitry are integrated on the same chip.
16. A method for controlling the operation of a memory device according to claim 1, the method comprising: obtaining manifest loop instructions; deriving a set of parameters based on the manifest loop instructions; and controlling the operation of the memory macro in accordance with the set of parameters.
17. A processing system comprising a memory device according to claim 1.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0040] Some example embodiments will now be described with reference to the accompanying drawings.
[0041]
[0042]
[0043]
[0044]
[0045]
[0046]
[0047]
[0048]
[0049]
[0050]
DETAILED DESCRIPTION OF CERTAIN ILLUSTRATIVE EMBODIMENTS
[0051] In-memory processing is an emerging technology aiming at processing of the data stored in memory. In-memory processing is based on the integration of the storage and computation, e.g., on the same chip, where the computing result is produced within the memory array which is designed to perform a specific operation such as bit-wise logical operation, Boolean vector-matrix multiplications or analog vector-matric multiplications. Typically, the memory device is a non-volatile memory (NVM), such as a resistive random access memory (resistive RAM), e.g. oxide-based RAM (OxRAM), a phase change memory (PCM) a spin-transfer torque magnetic RAM (STT-MRAM), or a spin-orbit torque magnetic RAM (SOT-MRAM). By processing the data stored in the memory, data transfers to and from the main memory which are time consuming and energy expensive is limited to minimum. Eliminating this overhead allows the data to be processed in real-time. However, in-memory processing requires a costly memory array redesign, leading to a high non-recurring engineering, NRE, thus making it less practical for many use cases. Aspects of the disclosure, without limitation, may be directed to in-memory processing.
[0052] A standard memory accesses a single word per memory address, which is a serious bottleneck for high-speed applications, such as image processing or video streaming. This bottleneck may be partially alleviated by making the word size very large, such that multiple words can be written or read in a single clock cycle as one large word. However, if only a few bits of such a large word need to be accessed, the memory will still access all bits within the large word, thereby wasting a lot of energy on the unused bits.
[0053] The present disclosure describes a solution according to which the periphery circuitry of the memory macro is enhanced with a programmable circuitry while keeping the memory array unchanged. The programmable circuitry allows parallel addressing of the data stored in one or more parts or segments of the memory array in a single memory cycle. The addressed parts or segments may not necessarily be consecutive or neighbouring. This allows speeding up the memory access and reducing the power at the same time by accessing only the required bits. The programmable circuitry is configured to receive instructions such as manifest loop instructions and even manifest loop nest instructions from an instruction set processor which are used to program the programmable circuitry. Once the programmable circuitry is programmed, the programmable circuitry autonomously accesses and optionally processes the stored data by controlling the reads or writes of bits in accordance with the received instructions. This eliminates the need for the instruction set processor to generate and send each memory address over the memory bus.
[0054] The memory macro and the implementation of the programmable circuitry according to the present disclosure will be explained in more detail below with reference to
[0055]
[0056] The parameters characterizing the memory array may thus be summarized as follows: [0057] number of address bits, A [0058] number of data bits, D [0059] number of segments, G [0060] number of word addresses X, and X=2.sup.A [0061] number of rows, R, and R=int(X/G) [0062] number of columns, C, and C=D*G [0063] number of words W, and W=R*G (<X) [0064] number of bits, B, and B=R*C.
[0065] Typically, the parameters A, D, and G characterizing the memory array are derived based on the application requirements. To do so, the source code of the application may be profiled by a code profiler to derive these parameters. Once, the parameters A, D, and G are derived, the other parameters characterizing the memory array, e.g., the R, C, B, W and X, are derived therefrom as detailed above.
[0066] The periphery circuitry 200 comprises a row and a column logic, also commonly referred to as an input logic 210 and an output logic 240, which respectively drive the word-lines 211 and bit-lines 231 of the memory array to control how data is written into or read from the memory array.
[0067] The programmable circuitry 250, which may include a field programmable gate array (FPGA), comprises a set of programmable registers and a control logic (not shown in the figure). Referring to
[0068] The manifest loop instructions comprise instructions implementing a loop structure indicating how and in what order data should be processed and the loop iterator based conditions to be satisfied. For instance:
[0069] FOR n=1 TO N [0070] FOR m=1 TO M [0071] A[n,m]=A[n,m−1]+B [n−1,m]; [0072] IF (n>m) [0073] C [n,m]=A [n,m]*B [n−1,m];
[0074] In this example, the loop structure comprises instructions of two “for” loops nested in one another. Such a loop structure may be represented by a set of manifest loop instructions as the loop conditions are data-independent, e.g., the loop structure only contains the loop iterators n and m. The strides in this case are increment by 1 for both n and m. Other integer numbers higher than 1 are also possible. For example, that increment may be 2 which means the stride is doubled. For a loop structure to be represented by manifest loop instructions, the increment value of the loop iterators should be a constant and that the loop conditions may not contain data-dependent condition such as if (A[n,m]>0).
[0075] The instructions may include one or more logic operations, such as logical “AND” and “OR” operations, or one or more arithmetic operations, such as addition, subtraction, multiplication, division. Further, the manifest loop instructions may comprise a stride operation that defines how the data is indexed within the loop structure.
[0076] The memory array is loaded with input data (DIN) via the input data terminal 241. The data may then be processed according to the manifest loop instructions as follows.
[0077] At a first step 410, the programmable circuitry 250 receives the manifest loop instructions via a program interface 251 as well as the control signals EN, CLK, RST, RW, PL and TF. The manifest instructions are in the form of lower-level instructions understandable by the control logic, such as assign, select, shift operations. These instructions may be provided by an instruction set processor or a similar processor, for example. In a following step 420, the programmable circuitry 250 processes the instructions, e.g., the programmable circuitry 250 translates the manifest loop instructions to derive a set of parameters characterizing the loop structure. The parameters include: [0078] a memory address pattern, U; [0079] a start memory address, P; [0080] a stride, S; and [0081] a number of loop iterations, L.
[0082] The loop iteration parameter L indicates the number of times the manifest loop instructions are to be executed for the corresponding loop iterator, while the other parameters indicate which data is to be processed. The memory address pattern (U) indicates the bit cells to be selected or addressed. In other words, the memory address pattern indicates the sequence of bit cells to which data is written into or read from. The start memory address (P) indicates the start location in the memory from which the bit cells are to be selected, e.g., from which row and column of the memory array the bit cells indicated by the memory address pattern (U) are to be selected. The memory address pattern (U), and the start memory address (P), together form the start memory address. The stride (S) parameter indicates how to derive the next memory address for a subsequent value of the loop iterator.
[0083] Typically, the first index in the memory address pattern U indicates the start memory address (P). However, in cases where the rows and columns of the memory array are very long it may be useful to provide the start memory address as a separate parameter (P) to maintain the memory address pattern relatively short. Optionally, the resulting start memory address may be masked using a mask pattern, M.
[0084] However, in case the number of bits of the start memory address (P) is very large (e.g., 2048 bits) and the width of the program interface 251 is relatively small (e.g., 32 bits), then 64 cycles will be required for the instruction set processor to program the starting pattern and another 64 for the mask pattern. This is a large programming overhead if only a small section of the memory array is to be addressed. In such cases, the instruction set processor may issue a reset command (RST) 252 to instruct the programmable circuitry 250 to the reset the pattern parameters with a single command. The instruction set processor may then program the relevant pattern sections in only a few additional commands. Alternatively, this programming functionality may be performed directly with the reset command (RST) itself, however, at the cost of more hardware.
[0085] Depending on the manifest loop instructions, the programmable circuitry may need to control the operation of either of or both the input and output logic independently. In the latter case, the memory address pattern U will comprise a row selection pattern (Ur) indicating the pattern of word-lines to be addressed, and a column selection pattern (Uc) indicating the pattern of bit-lines to be addressed, and the parameter P will comprise an initial column (Pc) and an initial row (Pr). Similarly, the parameters L and S may respectively comprise different values for the row and columns. Similarly, the mask pattern M may be different for the row and columns.
[0086] The memory address is thus decoded into a row pointer and a column pointer, which effectively selects one or more groups of consecutive bits from the memory array 200 in accordance with the values of the derived parameters, for example:
[0087] L=10
[0088] Uc=10100011//column address pattern
[0089] Pc=start_col//start column location
[0090] Sc=2//address every second column
[0091] Ur=00000010//memory address pattern
[0092] Pr=start_row//start row location
[0093] Sr=2//address every second row
[0094] In other words, the memory bits are accessed by a respective unique combination of a row and a column number in the case of a two-dimensional array or by a unique combination of a row, column and a page number in a three-dimensional array, just as two or three orthogonal planes define a point in two- or three-dimensional space.
[0095] Still referring to
[0096] The size of the registers U, P, S, and L may be determined based on the size of the register storing the G parameter. In practice, their size in bits may be chosen to be at least equal to log 2(G). For example, if the number of segments G in the memory array is 256,e.g., G=256, the size of registers U, P, S, and L may be 8 bits.
[0097] In a final step 440, the control logic of the programmable circuitry controls the operation of input and output logic, e.g., its input and output drivers, of the memory macro in accordance with the lower-level instructions and the derived parameters.
[0098] For the example illustrated above, in the first loop iteration, the column pointers will select columns 1, 3, 7 and 8 starting from Pc=start_col which value is derived from the manifest loop instructions and the physical memory locations that have been decided by the linker. The column pointer will thus select columns start_col+1, start_col+3, start_col+7 and start_col+8, e.g., for the positions in Uc where “1” is present. In the next iteration, because the stride parameters Sc is set to 2, the column pointers will again select columns 1, 3, 7 and 8 but this time starting from start_col+3, e.g., it will access columns start_col+3, start_col+5, start_col+9 and start_col+10. Because a parallel access is enabled, e.g., PL=1, at each iteration four columns are read or written concurrently until the loop condition is satisfied. In this example, the loop condition is satisfied when 10 iterations are completed, e.g., L=10. Similarly, in the first loop iteration, the row pointers will select row 7 starting from Pr=start_row. In the next iteration the row pointers will select row 9 and so on. By doing so, the data stored in the memory is accessed in accordance with the manifest loop instructions. Once the manifest instructions have been executed, e.g., the memory array addressing is completed, the programmable circuitry 250 issues an end of cycle notification (TF) 255 to notify the instruction set processor of the completion of the data processing and outputs the process data (DOUT) at the data output 243 of the memory macro.
[0099]
[0100]
[0101]
[0102] In case the manifest instructions comprise instructions of nested loops, the programmable circuitry may be configured to support several hierarchically stacked row and column pointers or hierarchically stacked shift registers. In this case, the lower layer pointers, e.g., the row and column pointers of the innermost loop structure, are controlling how the row and column pointers of the outermost loop structure shift at every loop iteration. For example, in case of two or more nested “for” loops, the rotating shift register corresponding to the inner loop starts again at the initial position, e.g., the start memory address, when the end of the loop, indicated by L, is reached. At that time also the outer loop iterator is shifted by the value of the stride parameter of that outer loop.
[0103]
[0104] The memory array may be two or three dimensional. A three-dimensional array may be formed by stacking several two-dimensional arrays on top of each other. A single data word may then be accessed by a unique combination of a row, column, and a page number, just as three orthogonal planes define a point in three-dimensional space. Further, the memory array may be extended to have an even higher dimensional memory organization. Mathematically, a memory array may be compared to a K-dimensional space, in which a point, e.g., a memory address, is defined by K crossing orthogonal hyperplanes with each hyperplane having a dimension of K-1. The point coordinates are therefore a vector in K-dimensional space.
[0105] For simultaneously accessing multiple rows, columns or pages, the concept of address pattern is employed as detailed above. Instead of a single coordinate that defines the position of one hyperplane, address patterns of 0's and 1 's for each respective dimension of the memory array are defined. The position of a 1 determines that the corresponding memory hyperplane is selected. Note that in a conventional memory, each pattern would contain a single 1, while all other entries are 0. This is commonly referred as “one-hot”. As described above, the pattern length for a respective dimension of the memory array may correspond with the number of available hyperplanes in that dimension. Thus, the address patterns may also be referred to as hyperplane patterns.
[0106]
[0107] Herein, the starting memory address pattern U defines the initial string of 1 's and 0's in the first memory cycle. For the next iteration or cycle, the U pattern will be shifted by the stride value S. The stride value may be positive or negative, or even zero. This shifting process is repeated by the number given in the iteration count L. The resulting pattern is masked with the optional mask pattern M to finally form the output pattern, Q, as shown in
[0108] When the stride value would cause a shift outside the available pattern range, the actual shift distance should be taken modulo-N, where N is the number of pattern bits. Bits that are shifted out at one side will be shifted in at the other side (e.g., “rotation”). The mask pattern can be used to confine the dynamic hyperplane activations to a static selection.
[0109] Because the stride value S can have any value, the hardware implementation of a pattern shifting register can become quite complex, e.g., every register bit needs a very large input multiplexer. In the case of a 32-bit register, the range of the stride value may be [−31:+31]. The stride value can select any register bit as the next value. In this case, 32 multiplexers are needed for selecting any one of the register bits forming a 32-bit left/right shift register multiplexer. The hardware complexity may be reduced by limiting the range of the stride value, S. For example, if the stride value is limited to the range [−3:+3], the number of the required multiplexer inputs will be limited to only 7, which drastically reduces the hardware complexity. This clearly shows there is a trade-off between shifting flexibility and hardware implementation complexity, which must be determined at design time.
[0110]
[0111] In practice, the number of hyperplane pattern bits may be larger than the number of data words that are supported by the memory data port (e.g., I/O). That sets a limit to the number of pattern bits that can be 1 simultaneously. For instance, if the memory word size is 32 bits and the memory data port is 128 bits, then maximum 128/32=4 words can be accessed at the same time. This means, that the number of pattern bits that can be simultaneously 1 should be 4 as there is simply no physical bandwidth to push more than 4 data words in and out.
[0112] The assignment of write data and selection of read data to/from the enabled hyperplanes (e.g., columns in a two-dimensional memory) is done by special assign and select units. In conventional memories with the single-bit patterns, data assignment is simply done by enabling the bit line driver of the selected column, whereas data post-selection is implemented with a (large) multiplexer. Herein, however, the assign and select is more complicated, as multiple hyperplanes selection is enabled. More particularly, the input data words are now assigned one by one to the currently enabled memory hyperplanes, whereas the output data are multiplexed into the output data word in the order as they appear in the hyperplane pattern. Note, that excess words will be rejected, and absent words will be filled with zeros. The complexity of this multiplexer is comparable to that of the left/right shift register multiplexer with a limited number of inputs as described above.
[0113] Further, to enable an arbitrary selection of the memory columns to be accessed simultaneously, each column of the memory array must have its own write driver and read sense amplifier. In current memory technologies, however, only a single row can be enabled at any time, but it is envisaged that future memory technologies may enable writing and/or reading multiple memory cells on the same column. Enabling arbitrary selection of the memory rows as well will of course considerably complicate the design of memory cell, bit line drivers and sense amplifiers. In such case, a multi-valued logic may be used.
[0114] To assess the effect of the proposed memory macro, a test case on a conventional memory macro stt_memory with a single row, single segment access, and the proposed memory macro stt_acamem with a single row, multiple segments access is shown below.
[0115] Test benches have been used to generate random input data to write to the specified words in memory. Then the same memory locations are read out again and compared with the expected data. Both for writing and reading the dissipated energy is monitored and the integrated numbers are reported when simulation finishes.
[0116] The energy and power parameters are read from an input file. Also, the array word locations are read from a file.
[0117] For the test, a rectangle [(x1,y1) . . . (x2,y2)] with (x1,y1)=(21,19), (x2,y2)=(31,23) in a memory array of size [G,R] is accessed. The parameters defining the memory array are set as follows: D=8, G=64, R=64.
[0118] The code block stt_memory implements a conventional memory macro and includes the code blocks for the memory array model and the conventional row and col decoders, respectively. The design hierarchy for such a conventional memory is as follows:
[0119] stt_memory—classical memory interface [0120] stt_rowdec—row decoder [0121] stt_coldec—column decoder
[0122] The code block stt_acamem implements the proposed memory macro and includes the code blocks for the memory array model and the programmable circuitry, respectively. The design hierarchy is as follows:
[0123] stt_acamem—(MRAM memory with ACA wrapper) [0124] aca_assign—implements a data write into relevant segments in memory array [0125] aca_select—implements data read from relevant segments in the memory array [0126] aca_rowcol—implements row and columns decoders->WL, DL [0127] aca_shireg—implements programmable shift or rotation registers [0128] stt_analog—implements the memory array and the analog driver/sense amplifier periphery
[0129] The memory matrix contains R rows and G word column groups. One column group, e.g., a segment, accesses a word of D bits simultaneously. Multiple word columns may be accessed at the same time. The number of input and output words that can be supplied or retrieved in one clock cycle is defined by the parameter N. The parameters SR and SC represent the strides for the rows and columns, respectively.
[0130] The memory interface of the proposed memory macro is as follows:
TABLE-US-00001 entity stt_acamem is generic (R, G, SR, SC, N, D: natural); port ( RST: in std_logic; -- asynchronous reset CK: in std_logic; -- clock EN: in std_logic; -- enable RW: in std_logic; -- read (0) / write (1) Ur: in std_logic_vector(R-1 downto 0); -- row init register Pr: in std_logic_vector(SR-1 downto 0); -- row starting point Sr: in std_logic_vector(SR-1 downto 0); -- row stride Lr: in std_logic_vector(SR-1 downto 0); -- row loop size Uc: in std_logic_vector(C-1 downto 0); -- column init register Pc: in std_logic_vector(SC-1 downto 0); -- column starting point Sc: in std_logic_vector(SC-1 downto 0); -- column stride Lc: in std_logic_vector(SC-1 downto 0); -- column loop size PL: in std_logic; -- parallel load TF: out std_logic; -- end cycle flag DIN: in std_logic_vector(N*D-1 downto 0); -- input data DOUT: out std_logic_vector(N*D-1 downto 0) -- output data ); end;
[0131] Compared to the classical memory interface stt_memory instead of having one address port, herein there are 11 new ports. The aca_assign instruction assigns a number of words from an input word array and assigns them to the output word array according to the bit values in a pointer array. The output array can be longer than the input array.
[0132] The input words DIN may be right-aligned. Other alignments are also possible. The number of assigned output words DOUT can be less than, equal to, or larger than the available number of input words. In the latter case, the associated output words are filled with zeros.
[0133] The interface of the aca_assign code block is as follows:
TABLE-US-00002 entity aca_assign is generic (D, G, N: natural); port ( EN: in std_logic; -- enable PT: in std_logic_vector(G-1 downto 0); -- assignment pointer bits DIN: in std_logic_vector(N*D-1 downto 0); -- input data array DOUT: out std_logic_vector(G*D-1 downto 0) -- output data array ); end;
[0134] The enabling signal EN which enables the memory array for reading or writing and the memory address pattern PT defines which rows or columns of the memory array are to be selected. The interface is organized to implement either read or write that is defined by the value of the parameter RW.
[0135] The aca_select instruction selects a number of words from an input word array according to the bit values in a pointer array and concatenates them in the output word array. The output words DOUT are, for instance, right-aligned into the output array. The number of selected input words DIN can be less than, equal to, or larger than the available number of output words. In the first case, empty output words are filled with zeros; in the latter case, the surplus input words are rejected. The interface of the aca_select instruction is as follows:
TABLE-US-00003 entity aca_select is generic (D, G, N: natural); port ( EN: in std_logic; -- enable PT: in std_logic_vector(G-1 downto 0); -- selection pointer bits DIN: in std_logic_vector(G*D-1 downto 0); -- input data array DOUT: out std_logic_vector(N*D-1 downto 0) -- output data array ); end;
[0136] wherein N indicates the number of words to be written, G indicates the number of word column groups and D the number of bits in a word. Similarly to above, the interface is organized to implement either read and write that is defined by the value of the parameter RW.
[0137] The code block aca_shireg implements the shift and rotation logic of the programmable circuitry 250 capable of reading multiple groups of consecutive bit cells simultaneously.
[0138] The interface of the aca_shireg block is as follows:
TABLE-US-00004 entity aca_shireg is generic (P, S: natural); port ( RST: in std_logic; -- asynchronous reset CLK: in std_logic; -- shift clock EN: in std_logic; -- enable clock EC: in std_logic; -- enable counting PL: in std_logic; -- parallel load (init) PD: in std_logic_vector(S-1 downto 0); -- starting point SD: in std_logic_vector(S-1 downto 0); -- shift distance LD: in std_logic_vector(S-1 downto 0); -- repetition count DI: in std_logic_vector(P-1 downto 0); -- parallel load data QO: out std_logic_vector(P-1 downto 0); -- output data TC: out std_logic -- terminal count ); end;
[0139] The code block aca_rowcol implements the row and column programmable logic of the programmable circuitry 250. It contains two aca_shireg blocks, one for each row and column pointer generation. The interface of the aca_rowcol code block is as follows:
TABLE-US-00005 entity aca_rowcol is generic (R, G, SR, SC: natural); port ( RST: in std_logic; -- asynchronous resetInterface: CLK: in std_logic; -- shift clock EN: in std_logic; -- enable PL: in std_logic; -- parallel load Pr: in std_logic_vector(SR-1 downto 0); -- row starting point Sr: in std_logic_vector(SR-1 downto 0); -- row stride Lr: in std_logic_vector(SR-1 downto 0); -- row loop size Pc: in std_logic_vector(SC-1 downto 0); -- column starting point Sc: in std_logic_vector(SC-1 downto 0); -- column stride Lc: in std_logic_vector(SC-1 downto 0); -- column loop size RD: in std_logic_vector(R-1 downto 0); -- row parallel load data CD: in std_logic_vector(G-1 downto 0); -- column parallel load data RO: out std_logic_vector(R-1 downto 0); -- row output data CO: out std_logic_vector(G-1 downto 0); -- column output data TF: out std_logic -- end of cycle flag ); end;
[0140] In this illustration, all ACA-related inputs are initialized to zero, except the parallel load signal PL, which is set to ‘1’ to enable parallel load. In addition, the RST signal must now be taken into account.
[0141] For the conventional stt_memory the memory addresses are generated in the testbench as follows:
TABLE-US-00006 for y in y1 to y2 loop for x in x1 to x2 loop address := G*y+x; ..... end loop; end loop;
[0142] The memory addresses for the stt_acamem are derived from the values of the parameters stored in the registers.
[0143] The simulation results are as follows:
[0144] run_stt_memory -a 12 -d 8 -g 64 -v ori
[0145] # A=12, D=8, G=64: W=4096, B=32768, R=64, C=512
[0146] # Simulation finished successfully at 2.99712 us
[0147] # Rectangle coordinate values: 21 19 31 23
[0148] # Primitive energy values [fJ]: 1.000 1.000 1.000 1.000 2.000 3.000
[0149] # Number of energy sinks: 1029
[0150] # Write time: 1471.80 ns
[0151] # Read time: 1525.32 ns
[0152] # Total write energy: 354.622 pJ Total write power: 240.944 uW
[0153] # Total read energy: 11.480 pJ Total read power: 7.526 uW
[0154] The same test was performed on the proposed memory stt_acamem which is defined to have the same size as the conventional memory array. The parameters for the row and column registers of the programmable circuitry have been pre-set as follows:
[0155] Ur<=(others=>‘0’);
[0156] Ur(y1)<=‘1’;
[0157] Pr<=y1;
[0158] Sr<=1;
[0159] Lr<=y2−y1+1;
[0160] Uc<=(others=>‘0’);
[0161] for i in x1 to x1+N−1 loop Uc(i)<=‘1’; end loop;
[0162] SC<=x1;
[0163] Cs<=N;
[0164] Lc<=x2-x1+1;
[0165] The programmable circuitry then generates the row and segment pointers in accordance with the parameters above. The simulation results are as follows:
[0166] run_stt_acamem -n 1 -d 8 -g 64 -r 64
[0167] # N=1, D=8, G=64: W=4096, B=32768, R=64, C=512
[0168] # Simulation finished successfully at 2.981064 us
[0169] # Rectangle coordinate values: 21 19 31 23
[0170] # Primitive energy values [fJ]: 1.000 1.000 1.000 1.000 2.000 3.000
[0171] # Number of energy sinks: 1029
[0172] # Write time: 1447.716 ns
[0173] # Read time: 1471.80 ns
[0174] # Total write energy: 352.982 pJ Total write power: 243.820 uW
[0175] # Total read energy: 7.925 pJ Total read power: 5.384 uW
[0176] As it can be seen, the write energy is about equal, because it is dominated by the MTJ write current. However, the read energy is down from 11.48 to 7.93 pJ, e.g., a reduction of 30% is achieved. This is due to the less address decoding.
[0177] In practice, most applications are read-dominated, due to the abundant “data reuse” present in the popular matrix, neural network, image, and video kernels. Thus, a gain in read energy is of high interest for the majority of the realistic applications.
[0178] In case a larger external word size (e.g. N=4) is used, then the following results are obtained:
[0179] run_stt_acamem -n 4 -d 8 -g 64 -r 64
[0180] # N=4, D=8, G=64: W=4096, B=32768, R=64, C=512
[0181] # Simulation finished successfully at 0.840264 us
[0182] # Rectangle coordinate values: 21 19 31 23
[0183] # Primitive energy values [fJ]: 1.000 1.000 1.000 1.000 2.000 3.000
[0184] # Number of energy sinks: 1029
[0185] # Write time: 377.32 ns
[0186] # Read time: 401.40 ns
[0187] # Total write energy: 387.745 pJ Total write power: 1027.639 uW
[0188] # Total read energy: 4.821 pJ Total read power: 12.010 uW
[0189] As it can be seen, the energy needed to retrieve the same data from the proposed memory macro is only 4.82 pJ, e.g., a reduction of almost 60% is achieved. The read power increased, however, because the data are retrieved in a 4x shorter time. The above results indicate the energy savings inside the memory macro only. Thus, on top of these results, there are additional savings obtained in the conventional processor units or DMA engines generating the addresses, and in the buses transporting the address bits. That is due to the fact that many sequential accesses and their corresponding address instruction generation is now replaced by a large amount of concurrency with the parallel accesses/loads. Hence, the number of bus activations and address instruction generation cycles are significantly reduced.
[0190] Although the present invention has been illustrated by reference to specific embodiments, it will be apparent to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied with various changes and modifications without departing from the scope thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the scope of the claims are therefore intended to be embraced therein.
[0191] It will furthermore be understood by the reader of this patent application that the words “comprising” or “comprise” do not exclude other elements or steps, that the words “a” or “an” do not exclude a plurality, and that a single element, such as a computer system, a processor, or another integrated unit may fulfil the functions of several means recited in the claims. Any reference signs in the claims shall not be construed as limiting the respective claims concerned. The terms “first”, “second”, third”, “a”, “b”, “c”, and the like, when used in the description or in the claims are introduced to distinguish between similar elements or steps and are not necessarily describing a sequential or chronological order. Similarly, the terms “top”, “bottom”, “over”, “under”, and the like are introduced for descriptive purposes and not necessarily to denote relative positions. It is to be understood that the terms so used are interchangeable under appropriate circumstances and embodiments of the invention are capable of operating according to the present invention in other sequences, or in orientations different from the one(s) described or illustrated above.