Low-layer memory for a computing platform

10019361 · 2018-07-10

Assignee

Imec Vzw (Leuven, BE)

Inventors

Cpc classification

International classification

Abstract

The present disclosure relates to low-layer memory for a computing platform. An example embodiment includes a memory hierarchy being directly connectable to a processor. The memory hierarchy includes at least a level 1, referred to as L1, memory structure comprising a non-volatile memory unit as L1 data memory and a buffer structure (L1-VWB). The buffer structure includes a plurality of interconnected wide registers with an asymmetric organization, wider towards the non-volatile memory unit than towards a data path connectable to the processor. The buffer structure and the non-volatile memory unit are arranged for being directly connectable to a processor so that data words can be read directly from either of the L1 data memory and the buffer structure (L1-VWB) by the processor.

Claims

1. A memory hierarchy being directly connectable to a processor, wherein the memory hierarchy comprises at least a level 1, referred to as L1, memory structure comprising a non-volatile memory unit as L1 data memory and a buffer structure (L1-VWB), wherein the memory hierarchy comprises a first decoding structure (S1) configured to allow data transfer between the non-volatile memory unit and the buffer structure, wherein the buffer structure comprises a plurality of interconnected wide registers with an asymmetric organization, wherein a data block size used towards the non-volatile memory unit is wider than a data block size used towards a data path connectable to the processor, and wherein the buffer structure and the non-volatile memory unit are arranged for being directly connectable to a processor so that data words can be read directly from either of the L1 data memory and the buffer structure (L1-VWB) by the processor.

2. The memory hierarchy of claim 1, wherein the data block size used towards the data path connectable to the processor is equal to the data block size used towards the non-volatile memory unit divided by four.

3. The memory hierarchy of claim 1, further comprising a second decoding structure (S2) configured to establish data transfer between the buffer structure (L1-VWB) and a second layer (L2) of a low-layer cache memory.

4. The memory hierarchy of claim 1, wherein the buffer structure comprises a smaller data block size than the non-volatile memory unit.

5. The memory hierarchy of claim 1, wherein the first decoding structure and a second decoding structure, configured to establish data transfer between the buffer structure and a second layer of a low-layer cache memory, each comprise at least one multiplexer and at least one demultiplexer.

6. The memory hierarchy of claim 5, further comprising a controller.

7. The memory hierarchy of claim 6, wherein the controller is configured to perform address tag comparison and to output control information to the first decoding structure or the second decoding structure.

8. The memory hierarchy of claim 7, wherein the buffer structure is configured to retrieve the control information.

9. The memory hierarchy of claim 1, further comprising a second decoding structure (S2) configured to establish data transfer between the buffer structure (L1-VWB) and a second layer (L2) of a low-layer cache memory, wherein the second layer (L2) comprises a plurality of memory subunits.

10. The memory hierarchy of claim 1, wherein the non-volatile memory unit comprises a plurality of memory subunits.

11. The memory hierarchy of claim 1, wherein the memory hierarchy is a cache memory.

12. The memory hierarchy of claim 1, wherein the buffer structure and the non-volatile memory unit are arranged for being directly connectable to the processor so that data words can be written to either of the L1 data memory and the buffer structure (L1-VWB) by the processor.

13. A system, comprising: a memory hierarchy; and a processor connected to the memory hierarchy, wherein the memory hierarchy comprises at least a level 1, referred to as L1, memory structure comprising a non-volatile memory unit as L1 data memory and a buffer structure (L1-VWB), wherein the memory hierarchy comprises a first decoding structure (S1) configured to allow data transfer between the non-volatile memory unit and the buffer structure, wherein the buffer structure comprises a plurality of interconnected wide registers with an asymmetric organization, wherein a data block size used towards the non-volatile memory unit is wider than a data block size used towards a data path connectable to the processor, and wherein the buffer structure and the non-volatile memory unit are arranged for being directly connectable to a processor so that data words can be read directly from either of the L1 data memory and the buffer structure (L1-VWB) by the processor.

14. The system of claim 13, wherein the data block size used towards the data path connectable to the processor is equal to the data block size used towards the non-volatile memory unit divided by four.

15. The system of claim 13, wherein the memory hierarchy further comprises a second decoding structure (S2) configured to establish data transfer between the buffer structure (L1-VWB) and a second layer (L2) of a low-layer cache memory.

16. The system of claim 13, wherein the buffer structure comprises a smaller data block size than the non-volatile memory unit.

17. The system of claim 13, wherein the non-volatile memory unit comprises a plurality of memory subunits.

Description

BRIEF DESCRIPTION OF THE FIGURES

(1) Some embodiments will now be described further, by way of example, with reference to the accompanying drawings, wherein like reference numerals refer to like elements in the various figures.

(2) FIG. 1 illustrates the performance penalty observed when replacing SRAM D-cache by a NVM counterpart with similar characteristics, according to example embodiments.

(3) FIG. 2 illustrates the performance penalty observed when replacing a SRAM D-cache by an STT-MRAM counterpart due to the increase in write latencies of the NVM memory, according to example embodiments.

(4) FIG. 3 illustrates a memory hierarchy comprising a low layer memory, according to example embodiments.

(5) FIG. 4 illustrates a detailed view on an embodiment of the low layer memory, according to example embodiments.

(6) FIG. 5 illustrates the process flow of the applied memory access transformations, their ordering and their success criteria, according to example embodiments.

(7) FIG. 6 illustrates the performance penalty for the modified NVM DL1 (with L1-VWB) based architecture with and without transformations. Here, the SRAM D-cache baseline is 100%, according to example embodiments.

(8) FIG. 7 illustrates the performance penalty change upon block size manipulations after transformations. Here, the SRAM D-cache baseline is 100%, according to example embodiments.

DETAILED DESCRIPTION

(9) Various embodiments will be described with reference to certain drawings but are not limited thereto, but only by the claims.

(10) Furthermore, the terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a sequence, either temporally, spatially, in ranking or in any other manner. It is to be understood that the terms so used are interchangeable under appropriate circumstances and that the embodiments described herein are capable of operation in other sequences than described or illustrated herein.

(11) It is to be noticed that the term comprising, used in the claims, should not be interpreted as being restricted to the means listed thereafter; it does not exclude other elements or steps. It is thus to be interpreted as specifying the presence of the stated features, integers, steps or components as referred to, but does not preclude the presence or addition of one or more other features, integers, steps or components, or groups thereof. Thus, the scope of the expression a device comprising means A and B should not be limited to devices consisting only of components A and B. It means that with respect to the described embodiments, the only relevant components of the device are A and B.

(12) Reference throughout this specification to one embodiment or an embodiment means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment. Thus, appearances of the phrases in one embodiment or in an embodiment in various places throughout this specification are not necessarily all referring to the same embodiment, but may. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner, as would be apparent to one of ordinary skill in the art from this disclosure, in one or more embodiments.

(13) Similarly it should be appreciated that in the description of various embodiments, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. This method of disclosure, however, is not to be interpreted as reflecting an intention that embodiments require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment.

(14) Furthermore, while some embodiments described herein include some but not other features included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the embodiments described herein, and form different embodiments, as would be understood by those in the art. For example, in the following claims, any of the claimed embodiments can be used in any combination.

(15) It should be noted that the use of particular terminology when describing certain features or aspects should not be taken to imply that the terminology is being re-defined herein to be restricted to include any specific characteristics of the features or aspects with which that terminology is associated.

(16) In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments may be practiced without these specific details. In other instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

(17) Present embodiments propose a configuration for the low-layer memory of a memory hierarchy for a computing platform that is able to overcome the read limitations of the STT-MRAM by means of an intermediate buffer referred to as a Very Wide Buffer (VWB). The proposed solution further also has a beneficial effect on the write limitations.

(18) The main architectural innovations introduced involve a buffer structure (L1-VWB) at the L1 level of the cache hierarchy and the associated modifications like block size changes and support structures (selector network specific as a result of the L1-VWB block size).

(19) FIG. 3 illustrates an embodiment of the proposed memory organization. The figure shows a processor being connected to the L1 layer of the memory hierarchy. The L1 layer comprises a non-volatile memory, which can be a two-way associative STT-MRAM based data cache memory in the shown embodiment, and the buffer structure (L1-VWB). Data can flow in any direction between the processor and the L1 memory. In order to allow data transfer between the STT-MRAM memory and the buffer structure a decoding structure S1 is provided. Further a L2 cache memory is shown, which can e.g. be SRAM or STT-MRAM. Decoding structure S2 takes care of the data transfer between the buffer structure and the L2 memory. S2 is connected to the non-volatile L1 memory. Note that there is also a direct connection between the non-volatile L1 memory and the L2 memory. The address input block provides input to the various blocks of the memory structure.

(20) FIG. 4 shows a more detailed view on an embodiment of the low-layer memory. The fully associative Very Wide Buffer (up-to 2 kbit) at layer L1 has an asymmetric register file organization: L1-VWB is wide towards the memory and narrower towards the data path/processor. The wide interface enables exploiting the locality of access of applications through wide loads from the memory to the VWB and wide stores from the VWB to the memory, thereby utilizing this bandwidth to hide latency. At the same time the data path is able to access words of a smaller width for the actual computations. Micro-architecturally the VWB is made of single ported cells. The VWB comprises, in an example embodiment, two rows of interconnected registers (of e.g. 1 kbit) that can ping pong the data between them. Data can be seamlessly transferred back and forth between the two rows of registers. Data can also be read or written into the L1-VWB at the same time (different rows). The data block size of the L1-VWB differs from that of the STT-MRAM based L1 data memory (which is usually not the case). This offers an increased degree of freedom while prefetching data into L1-VWB. The L1-VWB block size can for instance be taken to be the L1 data cache block size/4 (i.e. 64 bits/4=16 bits). The L1-VWB buffer structure has an associated post decode circuit comprising a multiplexer (MUX) to select the appropriate word from any row of registers, which is transferred from/to the processor/data path.

(21) Due to the difference in the block size, there may be, as already mentioned, two extra selector/decode (MUX/DEMUX) structures S1 and S2 present in the memory hierarchy for data transfer to/from the L1-VWB. Each of these layers has inputs from the cache controller, to go along with the address and data inputs. S2 is present for data transfer from the L2 cache to L1-VWB and S1 for data transfer at L1 level between the STT-MRAM based L1-data cache and L1-VWB.

(22) The cache controller handles the allocation of read and write operations to the L1 memory as well as to the L1 buffer structure. It also controls the potential cache evictions schemes from the L1 memory or from the L1 buffer structure and potentially distributes these data transfers over time. Since the L1 buffer structure is devoid of a corresponding tag array, the cache controller ensures accurate data transfer via control signals to the selector networks S1 and S2. This is realized by means of a table that holds all the relevant information like cache line location and way index.

(23) The L1-VWB buffer is completely hardware controlled. It has no associated address tag tables to compare hits and misses. The control lines providing input to the selector/decode network to and from the L1-VWB and the L1-VWB itself manages the hits and misses without address tag comparisons.

(24) The load and store policies for the L1 data cache and the corresponding VWB are as follows. For a load operation both the L1-VWB and STT-MRAM L1-data cache are always first checked for the data in this scenario. If data is present in the L1-VWB, it is a hit and the data is read. If the data is present in the L1 data cache, it is read from the NVM L1-data cache into the processor and it is also written into the buffer L1-VWB. In case the data is not present in the NVM L1-data cache either, it is recorded as a L1 miss and served from the next cache level into both the L1-VWB and NVM L1 data cache. The cache line containing the data block is then transferred into the processor via L1-VWB. Evicted data from the L1-VWB is stored in the NVM L1 data cache.

(25) For a store operation the data block in the NVM L1 data memory is updated via the L1-VWB if it is already present in it. Otherwise, it is directly updated via the processor. The L2 data block update is always via the NVM L1 data cache. A small write buffer is present to hold the evicted data temporarily, while being transferred to the L2, when the data block in question has to be renewed. No write through is present to the L2 and main memory. A write-back policy is implemented. If it is a miss, the write allocate policy is followed for the data cache array and a non-allocate policy for the L1-VWB. The data in the cache location is loaded in the block from the L2/main memory and this is followed by the write hit operation.

(26) Regarding performance, the proposed mechanism decouples the read and write hits from the NVM, effectively removing the long latency operations from the critical path. However, long latency reads may still happen when the VWB encounters a miss and the processor may try to fetch new data while the promotion of a cache line into the VWB is taking place (since the promotion may take as long as 4 cache cycles). Moreover, long latency writes can possibly still occur when the VWB encounters a capacity conflict caused by extensive write operations over a short time period or extensive accesses with no spatially locality.

(27) Reducing the block size of the L1 and thus limiting the data block size transferred to it would provide more room for prefetching and make the process considerably easier (transferring from STT L1-D).

(28) Almost all modern systems try to utilize some form of parallelization nowadays for the efficient use of resources and greater performance. Data level parallelism is exploited by means of vectorization (essentially loop vectorization). Vectorization is a process wherein the program is converted from a scalar implementation, which processes a single pair of operands at a time, to a vector implementation, which processes one operation on multiple pairs of operands at once. For example, modern conventional computers, including specialized supercomputers, typically have vector operations that simultaneously perform operations such as four additions/subtractions etc. The critical data and loops are identified and vectorized.

(29) Application profiling is performed via the assembly files and other related info that can be obtained at compile time. This yields an idea of the total number of load/store operations, the number of cycles associated with each of these operations and the general data flow during the application execution.

(30) The initial set of compiler transformations (e.g. data block alignment, loop optimization for data-locality, data dependency checks/analysis and indirect optimizations) is applied by means of flags at compile time (FIG. 5, step 1). Access pattern transformations, i.e. the prefetching and vectorization transformations, are then applied manually in the source code via intrinsic functions depending on the initial profiling and benchmark study, respectively (FIG. 5, step 2). Prefetching to either one or to both the L1 buffer structure and DL1 can be carried out. In case the prefetching leads to unwanted evictions, i.e. convergence is not reached yet towards the access pattern schedule imposed by the single-cycle memory baseline execution (meaning that some accesses occur too late), the execution time data flow is analyzed again and corrections on the too late accesses are applied by additional prefetch transformations on the corresponding data to access the data even earlier. Once convergence is reached, it is checked whether the baseline performance (e.g. single cycle memory baseline platform) is met. If this is not the case, one modifies the instruction scheduling manually again by means of intrinsic functions to allow more degree of freedom for prefetching, i.e. more cycle slack is introduced that can be exploited in the prefetch transformation step (FIG. 5 block 3). This prevents less bunching of misses and ultimately it enables us to meet the performance target of the single-cycle memory baseline execution.

(31) The optimization process (i.e. steps 1 and 2) is repeated based on the penalty cause (prefetching in case of conflicts, instruction scheduling in case of saving idle time for data access and vectorization for more data parallelism). The steps are repeated as shown in FIG. 5 until the performance penalty matches that of a single cycle memory baseline platform, e.g. a SRAM based L1 data cache baseline platform.

(32) When comparing the contributions of the read and write access to the total system penalty for a NVM based proposal to enable the use of appropriate transformations, one notes that the read contribution by far exceeds that of its write counterpart towards the total penalty. With increasingly complex kernels, the write penalty contribution also seems to increase, albeit slightly. However, the clear difference in impact between the two, even in case of the data cache (as compared to the instruction cache where reads are much more critical), makes a case for applying prefetching. Here, critical data and loop arrays can be pre-fetched to the VWB manually so reducing the time to read it from the NVM.

(33) Apart from the above listed transformations, access pattern optimizations like instruction scheduling, alignments of data blocks, data dependency checks and analysis, etc. also help in penalty reduction.

(34) It is also attempted to transform conditional jumps in the innermost loops to branch-less equivalents, guess branch flow probabilities and to reduce number of branches taken, thus improving the code for data locality.

(35) Some of these optimizations indirectly affect us because of the shared L2 cache: indirect optimizations.

(36) Typically, these other optimizations are carried out automatically by specifying the individual intrinsic function flags at compile time for the different benchmarks.

(37) FIG. 6 details the effects of the above mentioned transformations and optimizations on the performance of the modified STT-MRAM based DL1 organization.

(38) A breakdown of the contribution of various code transformations to the reduction of performance penalty reveals that prefetching and vectorization have the largest positive impacts. Other intrinsic functions for alignment, branch prediction and avoiding jumps etc. become more significant as the kernel becomes larger and more complex. Predictably, prefetching is most impactful for the smallest kernels.

(39) On the whole, in an example embodiment adaptive migration of data is carried out in such a way that read and write intensive data blocks are transferred from the NVM DL1 to L1-VWB. These transformations are steered manually by the use of intrinsic functions to modify the individual kernels. Critical data is prefetched to the L1-VWB manually and time taken to read it from the NVM DL1 is reduced. Moreover, storage operations are time-distributed and spatial locality is exploited in order to reduce the time taken to write to the NVM DL1. By means of instruction scheduling and vectorization one aims to reduce the idle time and exploit data parallelism, respectively.

(40) The effect of the size of the L1-VWB on the proposed solution has been studied. It is found that the increase in VWB size helps in reducing the performance penalty more. This is simply because of more data being able to fit into the VWB as a result of its increased capacity. However, a limit is present to the VWB size put forward by technology, circuit level aspects cost and energy. The routing and layout also becomes cumbersome. Hence, it is found ideal to keep the size of the L1-VWB to around 2 kbit considering the area gains offered by the NVM. A fully associative search also becomes a big problem with the increase in size of the VWB. FIG. 7 illustrates the effect of block size manipulations on the performance.

(41) While some embodiments have been illustrated and described in detail in the drawings and foregoing description, such illustration and description are to be considered illustrative and not restrictive. The foregoing description details certain embodiments. It will be appreciated, however, that no matter how detailed the foregoing appears in text, various embodiments may be practiced in many ways. Enabled embodiments are not limited to the disclosed embodiments.

(42) Other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claims, from a study of the drawings, the disclosure and the appended claims. In the claims, the word comprising does not exclude other elements or steps, and the indefinite article a or an does not exclude a plurality. A single processor or other unit may fulfil the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used. A computer program may be stored/distributed on a suitable medium, such as an optical storage medium or a solid-state medium supplied together with or as part of other hardware, but may also be distributed in other forms, such as via the Internet or other wired or wireless telecommunication systems. Any reference signs in the claims should not be construed as limiting the scope.

Low-layer memory for a computing platform

Assignee

Inventors

Cpc classification

Classification Explorer

G06F8/41

PHYSICS

Classification Explorer

G06F13/1657

PHYSICS

Classification Explorer

G11C11/1675

PHYSICS

Classification Explorer

G06F3/0611

PHYSICS

Classification Explorer

G06F12/0862

PHYSICS

Classification Explorer

G06F15/7846

PHYSICS

Classification Explorer

G06F9/30036

PHYSICS

Classification Explorer

G06F12/0893

PHYSICS

Classification Explorer

G06F13/1673

PHYSICS

Classification Explorer

G06F2212/6028

PHYSICS

Classification Explorer

G06F9/38

PHYSICS

Classification Explorer

G06F2212/1016

PHYSICS

Classification Explorer

G06F3/0685

PHYSICS

Classification Explorer

G11C11/1673

PHYSICS

Classification Explorer

G06F3/0688

PHYSICS

Classification Explorer

G06F12/0811

PHYSICS

Classification Explorer

G06F2212/454

PHYSICS

Classification Explorer

G06F2212/601

PHYSICS

Classification Explorer

G06F12/0897

PHYSICS

Classification Explorer

G11C11/165

PHYSICS

Classification Explorer

G06F2212/222

PHYSICS

Classification Explorer

G06F15/781

PHYSICS

Classification Explorer

G11C7/10

PHYSICS

Classification Explorer

G06F3/0656

PHYSICS

Classification Explorer

G06F8/452

PHYSICS

International classification