Low-layer memory for a computing platform
10019361 ยท 2018-07-10
Assignee
Inventors
- Francky Catthoor (Temse, BE)
- Praveen Raghavan (Leefdaal, BE)
- Matthias Hartmann (Kessel-Lo, BE)
- Komalan Manu Perumkunnil (Kerala, IN)
- Jose Ignacio Gomez (Madrid, ES)
- Christian Tenllado (Madrid, ES)
Cpc classification
G06F9/30036
PHYSICS
G06F9/38
PHYSICS
G06F3/0685
PHYSICS
G11C7/10
PHYSICS
International classification
G06F12/00
PHYSICS
G11C7/10
PHYSICS
G06F13/28
PHYSICS
G06F13/00
PHYSICS
G11C11/16
PHYSICS
Abstract
The present disclosure relates to low-layer memory for a computing platform. An example embodiment includes a memory hierarchy being directly connectable to a processor. The memory hierarchy includes at least a level 1, referred to as L1, memory structure comprising a non-volatile memory unit as L1 data memory and a buffer structure (L1-VWB). The buffer structure includes a plurality of interconnected wide registers with an asymmetric organization, wider towards the non-volatile memory unit than towards a data path connectable to the processor. The buffer structure and the non-volatile memory unit are arranged for being directly connectable to a processor so that data words can be read directly from either of the L1 data memory and the buffer structure (L1-VWB) by the processor.
Claims
1. A memory hierarchy being directly connectable to a processor, wherein the memory hierarchy comprises at least a level 1, referred to as L1, memory structure comprising a non-volatile memory unit as L1 data memory and a buffer structure (L1-VWB), wherein the memory hierarchy comprises a first decoding structure (S1) configured to allow data transfer between the non-volatile memory unit and the buffer structure, wherein the buffer structure comprises a plurality of interconnected wide registers with an asymmetric organization, wherein a data block size used towards the non-volatile memory unit is wider than a data block size used towards a data path connectable to the processor, and wherein the buffer structure and the non-volatile memory unit are arranged for being directly connectable to a processor so that data words can be read directly from either of the L1 data memory and the buffer structure (L1-VWB) by the processor.
2. The memory hierarchy of claim 1, wherein the data block size used towards the data path connectable to the processor is equal to the data block size used towards the non-volatile memory unit divided by four.
3. The memory hierarchy of claim 1, further comprising a second decoding structure (S2) configured to establish data transfer between the buffer structure (L1-VWB) and a second layer (L2) of a low-layer cache memory.
4. The memory hierarchy of claim 1, wherein the buffer structure comprises a smaller data block size than the non-volatile memory unit.
5. The memory hierarchy of claim 1, wherein the first decoding structure and a second decoding structure, configured to establish data transfer between the buffer structure and a second layer of a low-layer cache memory, each comprise at least one multiplexer and at least one demultiplexer.
6. The memory hierarchy of claim 5, further comprising a controller.
7. The memory hierarchy of claim 6, wherein the controller is configured to perform address tag comparison and to output control information to the first decoding structure or the second decoding structure.
8. The memory hierarchy of claim 7, wherein the buffer structure is configured to retrieve the control information.
9. The memory hierarchy of claim 1, further comprising a second decoding structure (S2) configured to establish data transfer between the buffer structure (L1-VWB) and a second layer (L2) of a low-layer cache memory, wherein the second layer (L2) comprises a plurality of memory subunits.
10. The memory hierarchy of claim 1, wherein the non-volatile memory unit comprises a plurality of memory subunits.
11. The memory hierarchy of claim 1, wherein the memory hierarchy is a cache memory.
12. The memory hierarchy of claim 1, wherein the buffer structure and the non-volatile memory unit are arranged for being directly connectable to the processor so that data words can be written to either of the L1 data memory and the buffer structure (L1-VWB) by the processor.
13. A system, comprising: a memory hierarchy; and a processor connected to the memory hierarchy, wherein the memory hierarchy comprises at least a level 1, referred to as L1, memory structure comprising a non-volatile memory unit as L1 data memory and a buffer structure (L1-VWB), wherein the memory hierarchy comprises a first decoding structure (S1) configured to allow data transfer between the non-volatile memory unit and the buffer structure, wherein the buffer structure comprises a plurality of interconnected wide registers with an asymmetric organization, wherein a data block size used towards the non-volatile memory unit is wider than a data block size used towards a data path connectable to the processor, and wherein the buffer structure and the non-volatile memory unit are arranged for being directly connectable to a processor so that data words can be read directly from either of the L1 data memory and the buffer structure (L1-VWB) by the processor.
14. The system of claim 13, wherein the data block size used towards the data path connectable to the processor is equal to the data block size used towards the non-volatile memory unit divided by four.
15. The system of claim 13, wherein the memory hierarchy further comprises a second decoding structure (S2) configured to establish data transfer between the buffer structure (L1-VWB) and a second layer (L2) of a low-layer cache memory.
16. The system of claim 13, wherein the buffer structure comprises a smaller data block size than the non-volatile memory unit.
17. The system of claim 13, wherein the non-volatile memory unit comprises a plurality of memory subunits.
Description
BRIEF DESCRIPTION OF THE FIGURES
(1) Some embodiments will now be described further, by way of example, with reference to the accompanying drawings, wherein like reference numerals refer to like elements in the various figures.
(2)
(3)
(4)
(5)
(6)
(7)
(8)
DETAILED DESCRIPTION
(9) Various embodiments will be described with reference to certain drawings but are not limited thereto, but only by the claims.
(10) Furthermore, the terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a sequence, either temporally, spatially, in ranking or in any other manner. It is to be understood that the terms so used are interchangeable under appropriate circumstances and that the embodiments described herein are capable of operation in other sequences than described or illustrated herein.
(11) It is to be noticed that the term comprising, used in the claims, should not be interpreted as being restricted to the means listed thereafter; it does not exclude other elements or steps. It is thus to be interpreted as specifying the presence of the stated features, integers, steps or components as referred to, but does not preclude the presence or addition of one or more other features, integers, steps or components, or groups thereof. Thus, the scope of the expression a device comprising means A and B should not be limited to devices consisting only of components A and B. It means that with respect to the described embodiments, the only relevant components of the device are A and B.
(12) Reference throughout this specification to one embodiment or an embodiment means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment. Thus, appearances of the phrases in one embodiment or in an embodiment in various places throughout this specification are not necessarily all referring to the same embodiment, but may. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner, as would be apparent to one of ordinary skill in the art from this disclosure, in one or more embodiments.
(13) Similarly it should be appreciated that in the description of various embodiments, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. This method of disclosure, however, is not to be interpreted as reflecting an intention that embodiments require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment.
(14) Furthermore, while some embodiments described herein include some but not other features included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the embodiments described herein, and form different embodiments, as would be understood by those in the art. For example, in the following claims, any of the claimed embodiments can be used in any combination.
(15) It should be noted that the use of particular terminology when describing certain features or aspects should not be taken to imply that the terminology is being re-defined herein to be restricted to include any specific characteristics of the features or aspects with which that terminology is associated.
(16) In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments may be practiced without these specific details. In other instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
(17) Present embodiments propose a configuration for the low-layer memory of a memory hierarchy for a computing platform that is able to overcome the read limitations of the STT-MRAM by means of an intermediate buffer referred to as a Very Wide Buffer (VWB). The proposed solution further also has a beneficial effect on the write limitations.
(18) The main architectural innovations introduced involve a buffer structure (L1-VWB) at the L1 level of the cache hierarchy and the associated modifications like block size changes and support structures (selector network specific as a result of the L1-VWB block size).
(19)
(20)
(21) Due to the difference in the block size, there may be, as already mentioned, two extra selector/decode (MUX/DEMUX) structures S1 and S2 present in the memory hierarchy for data transfer to/from the L1-VWB. Each of these layers has inputs from the cache controller, to go along with the address and data inputs. S2 is present for data transfer from the L2 cache to L1-VWB and S1 for data transfer at L1 level between the STT-MRAM based L1-data cache and L1-VWB.
(22) The cache controller handles the allocation of read and write operations to the L1 memory as well as to the L1 buffer structure. It also controls the potential cache evictions schemes from the L1 memory or from the L1 buffer structure and potentially distributes these data transfers over time. Since the L1 buffer structure is devoid of a corresponding tag array, the cache controller ensures accurate data transfer via control signals to the selector networks S1 and S2. This is realized by means of a table that holds all the relevant information like cache line location and way index.
(23) The L1-VWB buffer is completely hardware controlled. It has no associated address tag tables to compare hits and misses. The control lines providing input to the selector/decode network to and from the L1-VWB and the L1-VWB itself manages the hits and misses without address tag comparisons.
(24) The load and store policies for the L1 data cache and the corresponding VWB are as follows. For a load operation both the L1-VWB and STT-MRAM L1-data cache are always first checked for the data in this scenario. If data is present in the L1-VWB, it is a hit and the data is read. If the data is present in the L1 data cache, it is read from the NVM L1-data cache into the processor and it is also written into the buffer L1-VWB. In case the data is not present in the NVM L1-data cache either, it is recorded as a L1 miss and served from the next cache level into both the L1-VWB and NVM L1 data cache. The cache line containing the data block is then transferred into the processor via L1-VWB. Evicted data from the L1-VWB is stored in the NVM L1 data cache.
(25) For a store operation the data block in the NVM L1 data memory is updated via the L1-VWB if it is already present in it. Otherwise, it is directly updated via the processor. The L2 data block update is always via the NVM L1 data cache. A small write buffer is present to hold the evicted data temporarily, while being transferred to the L2, when the data block in question has to be renewed. No write through is present to the L2 and main memory. A write-back policy is implemented. If it is a miss, the write allocate policy is followed for the data cache array and a non-allocate policy for the L1-VWB. The data in the cache location is loaded in the block from the L2/main memory and this is followed by the write hit operation.
(26) Regarding performance, the proposed mechanism decouples the read and write hits from the NVM, effectively removing the long latency operations from the critical path. However, long latency reads may still happen when the VWB encounters a miss and the processor may try to fetch new data while the promotion of a cache line into the VWB is taking place (since the promotion may take as long as 4 cache cycles). Moreover, long latency writes can possibly still occur when the VWB encounters a capacity conflict caused by extensive write operations over a short time period or extensive accesses with no spatially locality.
(27) Reducing the block size of the L1 and thus limiting the data block size transferred to it would provide more room for prefetching and make the process considerably easier (transferring from STT L1-D).
(28) Almost all modern systems try to utilize some form of parallelization nowadays for the efficient use of resources and greater performance. Data level parallelism is exploited by means of vectorization (essentially loop vectorization). Vectorization is a process wherein the program is converted from a scalar implementation, which processes a single pair of operands at a time, to a vector implementation, which processes one operation on multiple pairs of operands at once. For example, modern conventional computers, including specialized supercomputers, typically have vector operations that simultaneously perform operations such as four additions/subtractions etc. The critical data and loops are identified and vectorized.
(29) Application profiling is performed via the assembly files and other related info that can be obtained at compile time. This yields an idea of the total number of load/store operations, the number of cycles associated with each of these operations and the general data flow during the application execution.
(30) The initial set of compiler transformations (e.g. data block alignment, loop optimization for data-locality, data dependency checks/analysis and indirect optimizations) is applied by means of flags at compile time (
(31) The optimization process (i.e. steps 1 and 2) is repeated based on the penalty cause (prefetching in case of conflicts, instruction scheduling in case of saving idle time for data access and vectorization for more data parallelism). The steps are repeated as shown in
(32) When comparing the contributions of the read and write access to the total system penalty for a NVM based proposal to enable the use of appropriate transformations, one notes that the read contribution by far exceeds that of its write counterpart towards the total penalty. With increasingly complex kernels, the write penalty contribution also seems to increase, albeit slightly. However, the clear difference in impact between the two, even in case of the data cache (as compared to the instruction cache where reads are much more critical), makes a case for applying prefetching. Here, critical data and loop arrays can be pre-fetched to the VWB manually so reducing the time to read it from the NVM.
(33) Apart from the above listed transformations, access pattern optimizations like instruction scheduling, alignments of data blocks, data dependency checks and analysis, etc. also help in penalty reduction.
(34) It is also attempted to transform conditional jumps in the innermost loops to branch-less equivalents, guess branch flow probabilities and to reduce number of branches taken, thus improving the code for data locality.
(35) Some of these optimizations indirectly affect us because of the shared L2 cache: indirect optimizations.
(36) Typically, these other optimizations are carried out automatically by specifying the individual intrinsic function flags at compile time for the different benchmarks.
(37)
(38) A breakdown of the contribution of various code transformations to the reduction of performance penalty reveals that prefetching and vectorization have the largest positive impacts. Other intrinsic functions for alignment, branch prediction and avoiding jumps etc. become more significant as the kernel becomes larger and more complex. Predictably, prefetching is most impactful for the smallest kernels.
(39) On the whole, in an example embodiment adaptive migration of data is carried out in such a way that read and write intensive data blocks are transferred from the NVM DL1 to L1-VWB. These transformations are steered manually by the use of intrinsic functions to modify the individual kernels. Critical data is prefetched to the L1-VWB manually and time taken to read it from the NVM DL1 is reduced. Moreover, storage operations are time-distributed and spatial locality is exploited in order to reduce the time taken to write to the NVM DL1. By means of instruction scheduling and vectorization one aims to reduce the idle time and exploit data parallelism, respectively.
(40) The effect of the size of the L1-VWB on the proposed solution has been studied. It is found that the increase in VWB size helps in reducing the performance penalty more. This is simply because of more data being able to fit into the VWB as a result of its increased capacity. However, a limit is present to the VWB size put forward by technology, circuit level aspects cost and energy. The routing and layout also becomes cumbersome. Hence, it is found ideal to keep the size of the L1-VWB to around 2 kbit considering the area gains offered by the NVM. A fully associative search also becomes a big problem with the increase in size of the VWB.
(41) While some embodiments have been illustrated and described in detail in the drawings and foregoing description, such illustration and description are to be considered illustrative and not restrictive. The foregoing description details certain embodiments. It will be appreciated, however, that no matter how detailed the foregoing appears in text, various embodiments may be practiced in many ways. Enabled embodiments are not limited to the disclosed embodiments.
(42) Other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claims, from a study of the drawings, the disclosure and the appended claims. In the claims, the word comprising does not exclude other elements or steps, and the indefinite article a or an does not exclude a plurality. A single processor or other unit may fulfil the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used. A computer program may be stored/distributed on a suitable medium, such as an optical storage medium or a solid-state medium supplied together with or as part of other hardware, but may also be distributed in other forms, such as via the Internet or other wired or wireless telecommunication systems. Any reference signs in the claims should not be construed as limiting the scope.