Data processing apparatus having cache and translation lookaside buffer
09684601 · 2017-06-20
Assignee
Inventors
Cpc classification
G06F12/1027
PHYSICS
Y02D10/00
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
International classification
G06F12/00
PHYSICS
G06F12/1027
PHYSICS
Abstract
A data processing apparatus has a cache and a translation look aside buffer (TLB). A way table is provided for identifying which of a plurality of cache ways stores require data. Each way table entry corresponds to one of the TLB entries of the TLB and identifies, for each memory location of the page associated with the corresponding TLB entry, which cache way stores the data associated with that memory location. Also, the cache may be capable of servicing M access requests in the same processing cycle. An arbiter may select pending access requests for servicing by the cache in a way that ensures that the selected pending access requests specify a maximum of N different virtual page addresses, where N<M.
Claims
1. A method for a data processing apparatus comprising processing circuitry configured to issue access requests for data; a cache configured to provide access to data in response to said access requests; and a translation lookaside buffer (TLB) configured to translate between virtual page addresses specified in said access requests and physical page addresses used by said cache; said method comprising steps of: buffering pending access requests issued by said processing circuitry in an input buffer; and selecting which of said pending access requests from said input buffer should be serviced by said cache in each processing cycle; wherein: said cache is configured to service up to M access requests in the same processing cycle, where M is an integer and M2; and said selecting step selects said pending access requests to ensure that the selected pending access requests selected for servicing by the cache in the same processing cycle specify maximum of N different virtual page addresses, where N is an integer, N<M and N1.
2. A data processing apparatus comprising: processing means for issuing access requests for data; cache means for providing access to data in response to said access requests; translation lookaside buffer (TLB) means for translating between virtual page addresses specified in said access requests and physical page addresses used by said cache means; input buffer means for buffering pending access requests issued by said processing means; and arbitration means for selecting which of said pending access requests from said input buffer should be serviced by said cache means in each processing cycle; wherein: said cache means is configured to service up to M access requests in the same processing cycle, where M is an integer and M2; and said arbitration means is configured to select said pending access requests to ensure that the selected pending access requests selected for servicing by the cache in the same processing cycle specify a maximum of N different virtual page addresses, where N is an integer, N<M, and N1.
3. A data processing apparatus comprising: processing circuitry configured to issue access requests for data; a cache configured to provide access to data in response to said access requests; a translation lookaside buffer (TLB) configured to translate between virtual page addresses specified in said access requests and physical page addresses used by said cache; an input buffer configured to buffer pending access requests issued by said processing circuitry; and an arbiter configured to select which of said pending access requests from said input buffer should be serviced by said cache in each processing cycle; wherein: said cache is configured to service up to M access requests in the same processing cycle, where M is an integer and M2; and said arbiter is configured to select said pending access requests to ensure that the selected pending access requests selected for servicing by the cache in the same processing cycle specify a maximum of N different virtual page addresses, where N is an integer, N<M, and N1.
4. A data processing apparatus according to claim 3, wherein said TLB is configured to translate up to N different virtual page addresses in the same processing cycle.
5. A data processing apparatus according to claim 3, wherein said arbiter is configured to select as candidate access requests the pending access requests specifying one of said N different virtual page addresses, and to select for servicing by said cache at least one of said candidate access requests.
6. A data processing apparatus according to claim 3, wherein N=2.
7. A data processing apparatus according to claim 3, wherein N=1.
8. A data processing apparatus according to claim 7, wherein in each processing cycle: (i) said TLB is configured to translate the virtual page address specified by a primary access request into a physical page address, said primary access request comprising one of said pending access requests from said input buffer; (ii) said arbiter is configured to select as candidate access requests said primary access request and the pending access requests from said input buffer that specify the same virtual page address as said primary access request; and (iii) said arbiter is configured to select at least one of said candidate access requests to be serviced by said cache using the physical page address translated by said TLB.
9. The data processing apparatus according to claim 8, wherein said primary access request is one of: the oldest of said pending access requests; one of said pending access requests having a highest priority; and one of said pending access requests having an access type of the highest priority.
10. The data processing apparatus according to claim 5, wherein said cache has a plurality of cache lines, and if multiple candidate access requests target data stored in the same cache line, then said arbiter is configured to merge said multiple candidate access requests to form a single merged access request.
11. The data processing apparatus according to claim 10, wherein said input buffer is configured to buffer X pending access requests, and said arbiter is configured to merge a maximum of Y candidate access requests into a single merged access request, where X and Y are integers and YX.
12. The data processing apparatus according to claim 3, wherein said cache comprises M banks of cache lines for storing data, wherein each bank is accessible independently and said cache is configured to service up to M access requests directed to different banks in the same processing cycle.
13. The data processing apparatus according to claim 12, wherein for each bank of cache lines having at least one bank candidate access request specifying one of said N different virtual page addresses and targeting a cache line within said bank, said arbiter is configured: (i) to select a first bank candidate access request of said at least one bank candidate access request; (ii) if there are one or more other bank candidate access requests targeting the same cache line as said first bank candidate access request, to merge said one or more other bank candidate access requests with the first bank candidate access request to form a merged access request, and to select said merged access request for servicing by said bank; and (iii) if there are no other bank candidate access requests targeting the same cache line as said first bank candidate access request, to select said first bank candidate access request for servicing by said bank.
14. The data processing apparatus according to claim 13, wherein said first bank candidate access request is one of: the oldest of said at least one bank candidate access request; one of said at least one bank candidate access request having a highest priority; and one of said at least one bank candidate access request having an access type of the highest priority.
15. The data processing apparatus according to claim 3, wherein said input buffer comprises storage circuitry configured to store the pending access requests which are not selected by said arbiter in a current processing cycle, said pending access requests stored in said storage circuitry being available for selection by said arbiter in a following processing cycle.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
(19)
(20)
(21)
(22)
DESCRIPTION OF THE EMBODIMENTS
(23)
(24) The processor 4 may issue data access requests for accessing data from the cache 6. An address computation stage 10 is provided for calculating a virtual address of the data required for a particular data access request. The L1 cache 6 in this example is a physically indexed, physically tagged (PIPT) cache, which uses physical addresses to identify the data stored in the cache. The L2 cache (and any further level caches that may be provided) and memory also use physical addresses. A translation lookaside buffer (TLB) 12 is provided to translate the virtual address calculated by the address computation stage 10 into the physical address used by the cache 6 and memory. A way table 14 is also provided corresponding to the TLB 12 for storing way information indicating which cache way of the cache 6 is storing required data. The way table will be described in more detail below.
(25) The apparatus 2 also comprises a store buffer (SB) 15 and a merge buffer (MB) 16. The store buffer 15 is used for handling store requests which have been speculatively issued by the processor 4. To improve performance, a store request may be issued speculatively by the processor 4 before it is actually known whether the store request should be carried out. For example, in an out of order processor, or following a branch instruction, it may not be known whether the instruction associated with the store request is actually needed until a preceding instruction has completed, but to improve performance, the instruction can be executed in advance and then the result of the instruction may be committed only if it turns out that the instruction was actually required. Unlike load requests, store requests will change the architectural state of the cache and so it is not desirable to carry out a speculatively issued store request unless the store request is committed. Therefore, store requests issued speculatively may be placed in the store buffer 15 until it is known that the speculatively issued store request should actually be carried out. It will be appreciated that if the processor 4 does not permit speculative execution of instructions, then the store buffer 15 may be omitted.
(26) When a store request is committed, it is sent to the merge buffer 16. If multiple committed store requests are pending corresponding to the same cache line of the cache 6, then to improve performance and avoid energy intensive memory accesses (e.g. to the L1 cache) the merge buffer 16 may merge these requests to form a single request which can be carried out by the cache 6 in one operation. The merged request in the merge buffer 16 is then serviced by the cache to write the merged data to the cache 6.
(27)
(28) The cache 6 is an n-way set-associative cache. This means that each cache bank provides n possible locations to store cache lines corresponding to one particular address, one in each cache way 24. In the example of
(29) The lower part of
(30)
(31)
(32) The cache 6 has two cache access modes: a standard cache access mode and reduced cache access mode. In the standard access mode, each of the data ways 24 is accessed and the tag data 22 is used to identify which of the data ways stores the required data, as described above.
(33) The cache 6 also has a reduced cache access mode which can be used to save energy in comparison to the standard access mode. The way table 14 contains way information indicating which of the cache ways 24 stores the data associated with a given memory address. Hence, the way information can be used to identify the required data way 24, and so it is not necessary to access the other data ways 24 or the tag array 22.
(34)
(35) Each way table entry 60 of the way table 14 stores way information for each memory location within the page associated with the corresponding TLB entry 62. The way information indicates which of the cache ways 24, if any, stores the data associated with the corresponding memory location. For example, the way table entry #0 identifies the way information for each of the memory locations within the page 30-0, and the way table entry #1 identifies way information for each memory location within page 30-1. The size of the TLB 12 and way table 14 is finite and so only a subset of the pages may have corresponding way table entries 60 and TLB entries 62 at a given time. If a new TLB entry 62 and way table entry 60 is required for a given page, then an existing TLB entry 62 and way table entry 60 may be evicted to make way for the new entries.
(36) The way table entries 60 also maintains validity information indicating whether or not the data associated with each memory location within the corresponding page is stored in the cache.
(37) In the first example (A), the way table entry 60 includes a valid field 70 and a way field 72 for each memory location (line) within the page 30 corresponding to that way table entry 60. In this example, the valid field 70 comprises a single bit which has a first value (e.g. a bit value of 1) if the data from the corresponding memory location is stored in the cache 6, and a second value (for example a bit value of 0) if the data from the corresponding memory location is not stored in the cache 6. The way field 72 comprises a 2-bit value indicating which of the four data ways 24 stores the data from the corresponding memory location. In the example (A), each data value may be placed in any of the four ways.
(38) In the second example (B), the validity information and way information are combined into a single field 74 and the number of ways 24 in which data from a particular memory location may be placed is reduced by one. A 2-bit combined validity/way field 74 is provided for each memory location within the corresponding page 30. One of the four potential values for the combined field 74 (e.g. 0b00) is used to indicate that the data is not stored in the cache (i.e. the validity information). The other three potential values (e.g. 0b01, 0b10, 0b11) are allocated to three of the cache ways 24, to indicate that the data is stored in that cache way (i.e. the way information). The data associated with the corresponding memory location may therefore not be placed in the fourth of the cache ways not represented by the three potential values of the way information. This approach reduces the number of bits required for each way table entry compared to example (A), and since there are fewer bits, energy consumption per way table access is reduced. The impact of this approach on the cache miss rate is minimal because most threads executed by the processor 4 do not utilise all four ways of the L1 cache 6. Moreover, even with the way table representation in example (B), the cache may still be realised as a four-way set associative cache. In particular, a cache with four banks 20 may hold lines 0 to 3 in separate banks and lines 0, 4, 8, . . . , 60 in the same bank. By deeming way 0 invalid for the lines 0 to 3, way 1 invalid for lines 4 to 7 and so on, a conventional thread that exhibits some form of special locality will most likely perceive the cache to be 4-way set-associative.
(39)
(40) It will be appreciated that the examples shown in
(41) In the example of
(42) On the other hand, it is also possible to implement the TLB 12 and way table 14 as a common memory having a number of entries. Each entry of the common memory may comprise both the corresponding TLB entry 62 and the corresponding way table entry 60. In such an embodiment, the common memory would search for an entry having a virtual page ID 44 corresponding to the virtual page ID of the address for the current data access request, and would then return the physical page ID and the corresponding way and validity information from that common memory entry.
(43)
(44) In this way, the way table allows a reduced cache access to be performed when possible, with the standard cache access being used when the way table does not provide the required information. As the standard access mode may still be required, the tag array 22 is still provided despite the fact that the tag array 22 is not necessary for the reduced access mode.
(45) As shown in
(46) In order to maintain the way table information, the way information is updated in response to operations which change the allocation of data in the cache 6. As shown in
(47) On a cache linefill, the corresponding field of the way table entry 60 within the micro way table 14-u or the way table 14-f is updated so that the validity information now indicates that the information is stored in the cache and the way information now indicates which way stores the data. On the other hand, when data is evicted from the cache, then the validity information is updated to indicate that the data is no longer stored in the cache, to indicate that the way information corresponding to that memory location should not be used on a subsequent cache access.
(48) If a micro-way table entry is updated, then the corresponding entry within the further-level way table 14-f is not updated until the corresponding micro-way table entry is evicted from the micro-way table 14-u. Deferring the update of the further-level way table 14-f saves energy, because even if the entry in the micro-way table 14-u is updated multiple times before being evicted, the further-lever way table entry only needs to be updated once.
(49) The capacity of the further-level TLB 12-f and further-level way table 14-f is finite and so these tables may not store entries corresponding to all pages of the address space. Hence, when required an entry from the TLB or the way table can be evicted to make space for another entry corresponding to a different page. The corresponding TLB entry would be fetched from a page table associated with the memory, while a way table entry would be allocated to the way table which would initially be in a reset state in which the validity information indicates that the data associated with all the memory addresses within the corresponding page is not stored in the cache.
(50) Therefore, it is possible that data stored in the cache may have its corresponding way table entry evicted from the further way table 14-f even while the data remains in the cache. When the corresponding way table entry is brought back in to the way table 14-f, it will initially be a reset state even though the corresponding data is stored within the cache. Therefore, the cache 6 may not be able to assume that data, which the validity information indicates is not stored in the cache, is in fact not stored in the cache. Therefore, even when validity information indicates that the data is not stored in the cache, the cache would still perform a normal cache access using the tag array 22 to identify whether the data is actually stored in the cache. With the arrangement of
(51) However, for other applications, it may be possible that cached data may be accessed at intervals over a period of time. In this case, it can be useful to use the update scheme shown in
(52) To identify which way table entry 14 should be updated in response to the way table miss/cache hit, the way table 14 or TLB 12 could perform a search based on the address of the data. However, this would consume extra energy. Therefore, to avoid having to perform a further search, the way table of
(53) Optionally, the way table 14 may also be used to predict whether a L1 cache miss would occur for the desired data. If the validity information for a given memory address indicates that the corresponding data is not stored in the cache, then this could be interpreted as a L1 cache miss, and a L2 cache access or memory access could be initiated without needing to access any of the data ways 24 or tag array 22 of the L1 cache 6, freeing these data ways 24 and tag array 22 for use in servicing other cache accesses, or allowing the data ways 24 and tag array 22 to remain idle to save energy. However, this approach would require that the validity information is guaranteed to be correct. As discussed above, this may not be the case if way table entries are evicted from the way table 14 and then restored to the way table 14 at a later time while the corresponding data remains in the cache. To address this issue, evicted way table entries may be stored in a separate memory or in an additional further-level way table, or entries which have been restored to the way table in a reset state could be marked as an entry for which L1 cache miss prediction should not be performed. However, this would require additional circuitry for storing the evicted entries or the additional miss prediction bits, and the energy consumed by this additional circuitry may not be justified because most L1 caches exhibit very low miss rates and so the advantage of L1 miss prediction may not arise often. Therefore, miss prediction based on the validity information is optional.
(54)
(55)
(56)
(57) As will be discussed below, the cache 6 may service multiple requests directed to different banks within the same processing cycle. If so, then it may be necessary to update multiple fields within the same way table entry, or multiple way table entries, in the same processing cycle. Therefore, the methods of
(58) In summary, by providing a way table for storing way information, power consumption can be reduced by accessing only the way indicated in the way information and not accessing the other ways 24 or the tag array 22. By providing way table entries which group together way information corresponding to a page of memory, and linking each way table entry to a corresponding TLB entry, searching the TLB and way table becomes more efficient since a single search of one of these tables is enough to locate both the corresponding TLB entry and way table entries.
(59)
(60) An input vector 210 is provided for buffering pending requests which await servicing by the cache. An arbiter 230 is provided to select which of the pending requests indicated in the input vector 210 should be serviced in a given processing cycle. Load requests from address computation stages 10-0 to 10-3 are placed directly in the input vector 210 to await servicing by the cache. Meanwhile, speculatively issued store requests are sent to the store buffer 15 to await confirmation that the store requests are to be carried out. Once committed, the store requests are transferred to the merge buffer 16 to await servicing by the cache 6. The merge buffer 16 may merge multiple store requests from the store buffer 15 to form a merge buffer entry. Merge buffer entries are then provided to the input vector 210 to await selection by the arbiter 230.
(61) Typically, the merge buffer 16 would defer sending merge buffer entries to the input vector 210 until it receives a new request and would otherwise overflow unless it issues a merge buffer entry to the input vector 210. That is, if the merge buffer 16 receives a store request from the store buffer 15 that addresses a region not represented by any current merge buffer entry, and there are no free merge buffer entries, then the merge buffer 16 would evict a current merge buffer entry and send it to the input vector 210. By delaying eviction of the merge buffer entry as long as possible, this increases the probability that multiple store requests can be merged and processed in the same cycle, therefore making more efficient use of the cache.
(62)
(63) Each vector entry comprises a valid field 232 which indicates whether the entry represents a valid request. For example, some of the address computation stages 10-0 to 10-3 may not be active in a given cycle, or there may not be a pending merge buffer entry #0 or there may not be enough requests left over from a previous cycle, and so some of the vector entries may be invalid. Each vector entry also has an address field 234 for storing the virtual address associated with the data access request, and a type field 236 which indicates the type of entry (e.g. whether the entry corresponds to a load request or a merge buffer entry). Each vector entry also has an age identifier field 238 (e.g. a reorder buffer identifier or ROB ID field) which stores an age identifier distinguishing different access requests of the same type. The age identifier may be allocated sequentially and may be used, for example, to allow a determination of whether one request is older than another request. It will be appreciated that other information or other age identifiers may also be stored in the vector entries.
(64) Note that the input buffer 210 need not comprise storage locations for all of the vector entries 212. For example, the merge buffer entry and new requests may be read directly from the wires received from the merge buffer 16 and address computation stage 10, so that the input buffer 210 need not actually store any data for these requests. On the other hand, storage can be provided for storing the older requests which were not serviced in a previous cycle. Alternatively, in other examples all the requests may be buffered in storage provided in the input buffer 210.
(65) Turning back to
(66)
(67) Each bank comparator 250 has a bank identifier 252 and a cache line identifier 254. The bank identifier 252-0 examines the address of each candidate request selected by comparator 240, and determines from a portion of the address (typically, part of the page offset portion 46) whether the address corresponds to bank 0. If there are any candidate access request corresponding to bank 0, then the first of these requests (for example, the request in the highest priority vector entry) is selected and any other requests targeting bank 0 are provided to the cache line identifier 254. If the first candidate request targeting bank 0 is a load request, then the cache line identifier 254 checks whether the other candidate requests targeting bank 0 target the same cache line as the first candidate request targeting bank 0. If so, then these requests are merged with the first candidate request for bank 0, and the merged request is provided to a selector 260. If there are no other candidate requests for bank 0 that target the same cache line as the first candidate request for bank 0, or the first candidate request is a merge buffer entry, then the first candidate request selected for bank 0 is provided to the selector 260 (merge buffer entries cannot be merged with load requests).
(68) Meanwhile, the other bank comparators 250-1 to 250-3 perform the same operation as bank comparator 250-0 in respect of banks 1, 2 and 3. The selector 260 receives the requests (which could be a single request or a merged request) selected by each bank comparator 250-0 to 250-3 and then issues for servicing by the cache a maximum of one request for each bank 20. The selected request are then provided to the cache 6 which services the request, either returning a value in response to a load request or a writing value to the cache in response to the store request indicated in a merge buffer entry. The results of the access requests are then written back to the processor via buses 280 as shown in
(69) Hence, as shown in
(70) Note that the selector 260 may limit the total number of requests which can be provided to the cache 6 in the same cycle. The number of results buses 280 that connect the cache 6 to the processor 4 may be limited, and so even if many requests can be merged by the arbiter 230, the selector 260 may have to limit the total number of serviced requests to the number of results buses 280. For example, in
(71) While it would be possible for the cache line identifier 254 of each bank comparator 250 to compare the first candidate request selected for each bank with all of the other candidate requests targeting the same bank, in the example of
(72)
(73) Providing circuitry for handling two different page addresses within the same cycle can be useful because some applications may require memory copy operations in which data is read from one memory location and written to a different memory location. Such operations may often require data access requests to be performed in connection with two different pages of memory, and so it can be more efficient to allow those operations to take place in parallel within the same cycle by enabling simultaneous access requests targeting different pages of memory. Nevertheless, by limiting the maximum number of different pages can be handled in the single cycle to two, the circuitry may still be more efficient than in the case where access request specifying any number of different page addresses can be handled simultaneously. In general, the present technique is applicable in any case where the maximum number of different page addresses which can be handled in the same cycle is less than the total number of access request which can be handled by the cache in parallel.
(74)
(75) At step 304, the virtual page ID specified by the first valid vector entry (containing the primary access request) is sent to the TLB 12. At step 306, the TLB 12 receives the virtual page ID of the first valid vector entry and locates the corresponding TLB entry. The TLB translates the virtual page ID to a physical page ID using the TLB entry and returns the physical page ID to the arbitration unit 230. Also at step 308, the way table 14 accesses the corresponding way table entry and the way table 14 sends the way information contained in the way table entry to the arbitration unit 230. When requests are sent to the cache 6 for servicing, the arbitration unit 230 provides the corresponding physical page ID and way information to the cache 6 for use when servicing the requests. Since all requests serviced in the same cycle share the same page ID, the same physical page ID and way table entry can be used for all the requests.
(76) Meanwhile, at step 310, the arbitration unit 230 selects as candidate access requests any of the load requests or merge buffer entries in the input vector 210 which have the same virtual page ID as the virtual page ID sent to the TLB 12. Any requests having different page IDs are not selected.
(77) At step 320, the bank comparator 250-0 checks whether any of the candidate access requests selected at step 310 require access to bank 0 (these requests are referred to as bank 0 candidate access requests). If there is at least one bank 0 candidate access request, then at step 322 the bank comparator 250-0 checks whether one of these requests is a merge buffer entry. If so, then at step 324 that merge buffer entry selected to be serviced by the cache. If there is no bank 0 candidate merge buffer entry, then at step 324 the bank comparator 250-0 selects the first valid candidate load request that targets bank 0. At step 326, the cache line identifier 250-0 examines the other candidate loads targeting bank 0 and checks whether there are any other bank 0 candidate loads which target the same cache line as the first bank 0 candidate loads selected at step 324. If there are such other loads, then at step 328 the loads targeting the same cache line are merged; and the selector 260 selects the merged request for servicing by the cache. If there are no other bank 0 candidate load requests targeting the same cache line as the first bank 0 candidate load, then at step 330 the first bank 0 candidate load is selected for servicing by the cache.
(78) Meanwhile, at steps 340, 350, and 360, the other bank comparators 250-1, 250-2 and 250-3 perform the same steps as in steps 322 to 330 with respect to the candidate requests selected at step 310 that target banks 1, 2 and 3 respectively. Hence, for each bank, if there is at least one candidate request targeting that bank, then a request will be selected for that bank. All the selected requests are sent to the cache 6, and are serviced by the cache 6 at step 380 using the physical page ID and way table entry obtained at steps 306, 308. At step 382, the cache 6 signals to the way table 14 if an update of the way table information is required (if one of the cache accesses results in a cache line fill, eviction, or micro way table miss/cache hit, as discussed previously), and the way table 14 updates the corresponding fields of the way table entry accordingly. At step 384, any access requests remaining in the input vector 210 which have not been serviced by the cache 6 in this cycle are stored in vector entries #1, #2 or #3 as shown in
(79) The store buffer 15 and merge buffer 16 can be further optimised in the context of an embodiment which allows only one virtual page address to be handled within the same cycle. To allow fully associative searches by up to four loads per cycle, the corresponding comparator arrays for searching the store buffer 15 and merge buffer 16 would usually need to be replicated four times. However, if all requests serviced within one cycle have to access the same page, then the comparator arrays responsible for address comparisons can be separated into one common comparator for a virtual page address and four independent comparators for the remaining address fields. For a system using 32 bit addresses and 4 Kbyte pages, this saves three 20 bit comparisons, thus saving energy.
(80) In summary, the present technique is able to service multiple cache accesses in parallel to one another with only a limited number of TLB ports for performing TLB translations and a limited number of comparators in other portions of the system such as in the arbiter 230 or cache 6. This enables a more efficient system with lower energy consumption.
(81) Further information about the present technique is provided below.
1. Introduction
(82) Modern out-of-order superscalar processors rely on speculative execution, wide issue windows and multiple data paths to exploit instruction-level parallelism (ILP), and on memory hierarchies comprising fast on-chip SRAM caches to improve memory access latency and throughput. As a high percentage of all instructions are memory references (40% on average for SPEC CPU2000 running on ARMv7), it is vital for first-level data caches to support multiple accesses in parallel (see Section 2). This problem is exacerbated by advanced memory speculation and disambiguation techniques, which are fairly common in the latest generation of microprocessors, and by vector extensions, like Intel AVX and ARM NEON, which provide support for the SIMD computation paradigm and can introduce even more demanding requirements for the memory subsystem.
(83) However, finite energy sources and limitations in semiconductor technology scaling result in fixed energy budgets for mobile devices; in addition, cooling and electricity costs are becoming increasingly important in the desktop and server segments. As a consequence, energy efficiency has become a determining factor for microprocessor designs and one of the main obstacles to performance improvements. As caches are one of the main contributors to the on-chip power consumption, cache-level optimisations need to trade-off increased performance with degraded energy efficiency.
(84) This technique addresses the problem of implementing multiple-access L1 data caches in an energy efficient manner. Current high end microarchitectures like Intel's Sandy Bridge and AMD's Bulldozer allow up to two 128-bit loads and one 128-bit store per cycle. Both rely on physical multi-porting and cache banking. The former technique is based on modified SRAM cells with multiple ports; it allows low access latencies, but introduces large energy and area penalties. Contrarily, banking effectively reduces the energy consumption per access, by utilizing several smaller structures, each holding a sub-set of cache lines. While banks can be accessed independently to service multiple memory references, accesses mapping to the same bank need to be serialized (bank conflicts). Upcoming processor generations require even more sophisticated cache interfaces to handle aggressive memory speculation and disambiguation techniques as well as advanced vector extensions (e.g. Intel's AVX2 supporting non-unit strided loads).
(85) The problem of handling multiple concurrent memory accesses has already been analysed in the context of vector machines. Tarantula, CODE and the VT architecture are examples of specialized vector architectures. However, these designs require a significant amount of dedicated hardware, and the implemented solutions, while effective on workloads with abundant data-level parallelism, are not suited for general-purpose microprocessors. Key features of modern caches are high capacity and set-associativity to accommodate large working-sets and reduce miss rates, respectively. Yet, although a particular datum can only be located in one way, the lookup of an n-way set-associative cache requires n tag comparisons and n data-array accesses. Techniques that attempt to save energy by avoiding redundant accesses may be categorized as way prediction, way estimation and way determination schemes. The first group predicts ways based on MRU statistics. While this concept is simple to implement, false predictions require a second cache access to find the desired datum within the previously discarded ways. Other schemes attempt to mitigate this problem by increasing prediction accuracy based on a combination of selective direct-mapping and way-prediction. Way estimation techniques deliver a set of ways instead of a single way. If the desired data resides within the cache, it is guaranteed to be found there. Consequently, cache accesses require no more than one cycle, but may consume energy for several redundant tag comparisons. An alternative group of techniques determines rather than predicts ways. A way determination unit (WDU) stores way information for a set of recently used cache lines in a small buffer. Each line is associated with exactly one way and guaranteed to hit there or miss the whole cache. The WDU requires a fully associative lookup structure including one port per parallel access, resulting in energy efficiency inversely proportional to its size. Consequently, it is limited to systems with high temporal locality of cache accesses designed to service a low number of loads and stores in parallel.
(86) The Multiple Access Low Energy Cache (MALEC) proposed in this technique is based on the observation that consecutive memory references are very likely to access the same page. Consequently, it shares memory address translation results between multiple accesses and simplifies the lookup structures of certain components common in superscalar out-of-order processors (i.e. store and merge buffers). Page-based access grouping furthermore allows the application of a small set of comparators to identify loads accessing the same cache line. Sharing data received from a single cache access among those loads effectively reduces the number of bank conflicts. Moreover, MALEC introduces a novel way determination scheme that simultaneously provides way information to loads and stores accessing the same page. It is capable of re-using address comparisons required for TLB lookups to simplify its own lookup mechanism. With the addition of validity information for way determinations it enables a majority of memory references to completely avoid tag comparisons and directly access desired cache lines. This distinguishes MALEC from other prediction schemes for D-caches that need to verify their results with at least one tag comparison.
2. Motivation
(87) A primary performance metric for superscalar processors is the number of instructions executed per cycle. Our analysis of the SPEC CPU2000 and MediaBench2 benchmark suites showed a severe impact of memory references on this metric. In fact, they constitute 40% of all instructions (ratio loads:stores=2:1). SIMD extensions, which operates on vectors rather than scalars, further intensifies the need for a powerful cache interface.
(88)
(89)
3. Basic Processor Cache Interface
(90) Since MALEC targets out-of-order superscalar processors, it is important to consider a variety of complex components relevant for those designs.
(91) The Store Buffer (SB) shown in
(92)
4. Page-Based Access Grouping
(93)
(94) Stores finishing address computation are directly sent to the SB to reduce the number of address translations per cycle. This does not impose a significant performance penalty, as stores commit to the MB instead of the L1. Contrarily, loads finishing address computation represent four out of eight entries of the so called Input Vector. The remaining entries are up to three loads that could not be serviced in previous cycles and up to one Merge Buffer entry (MBE). Entries are prioritized in the order: old loads, new loads and evicted MBE. Reason for the low priority of evicted MBEs is the fact that stores represented by an MBE are already committed and therefore no longer time critical. At the start of each cycle, the virtual page ID (vPageID, see
(95) Memory accesses selected by the Arbitration unit are sent to the L1 cache and in case of loads also to the SB and MB. The cache itself is unmodified to allow the implementation of highly optimized designs. A special case are sub-banked caches that attempt to save energy by splitting data arrays in smaller independent banks (usually 128 bit wide). MALEC requires those designs to return data from two sub-banks for every read access, instead of only for those that exceed one sub-bank. This effectively doubles the probability for loads to be able to share results read from cache. The designs of SB and MB are slightly modified to reduce the energy impact of additional ports required to service up to four loads in parallel. Specifically, their lookup structure is split into two segments. One to look up address bits corresponding to the vPageID shared among all four loads. The second to compare the remaining bits that identify access specific address regions (i.e. cache lines/sub-banks). As both segments are looked up simultaneously, the MB and SB energy requirements are reduced without introducing additional latencies. Note, as SB and MB are in contrast to the L1 cache virtually tagged, the vPage lookup could actually be performed prior to address translation. However, this would complicate the execution of address translation and data access in separate pipeline stages as it is common for high performance processors.
(96) In summary, MALEC attempts to service multiple instructions in parallel by utilizing techniques like cache banking and merge buffers. In addition, it allows loads to share data read from L1 cache lines and introduces mechanisms to improve energy efficiency. Page-based memory access grouping is utilized to re-use page translation results and simplify address comparisons within the Arbitration Unit, the SB and the MB. The Arbitration Unit is further simplified by the limitation to three consecutive accesses following the highest priority load to a particular bank.
(97) A key concern of MALEC is the latency introduced by its components. To address this, the Input Vector masks the latency of vPageID comparisons by performing them simultaneously to uTLB/TLB accesses. Comparisons between the remaining address bits to identify loads accessing the same cache line are done in parallel within the Arbitration Unit. Consequently, the units overall latency is equivalent to one narrow comparator and some additional control circuitry. Another concern is the scalability of MALEC. In fact, its efficiency is proportional to the number of parallel memory requests generated by the processor. Reason for this is the increasing probability to have multiple accesses within the Input Vector that may share address translations, way information andin case of loads to the same cache linedata read from L1. Furthermore, MALEC would benefit systems with 40 bit or 48 bit of address space even more, because the additional energy needed for vPage comparisons within the Input Vector is outweighed by higher savings due to the reduced number of address translations and the simplified SB and MB lookups. In addition, larger L1 tag-arrays further improve the efficiency of MALEC's way determination scheme (see Section 5). Finally, a major concern for out-of-order processors is the handling of precise exceptions. The amount of speculative state held by MALEC itself is limited to loads within the Input Vector and the Arbitration Unit. All other information, e.g. in form of uWT/WT entries or evicted MB entries, is non-speculative and therefore of no concern for any recovery mechanism.
5. Page-Based Way Determination
(98) Key components of the way determination scheme proposed in this technique are the so called Way Table (WT) and Micro Way Table (uWT) (see
(99) The Arbitration Unit evaluates WT entries by associating ways to groups of desired cache lines and forwarding them to the corresponding cache banks. MALEC utilizes way information by supporting two different access modes for the previously introduced cache design of
(100) Conventional cache access (way unknown):
(101) parallel access to all tag- and all data-arrays
(102) select data associated with matching tag
(103) Reduced cache access (way known and valid):
(104) No tag arrays accessed
(105) access to one specific data-array only
(106) Prerequisite for reduced cache accesses is the accuracy of way information. Updates of uWT and WT are performed on each cache line fill and eviction, whereby the WT is only accessed if no uWT entry was found. The synchronization of uWT and WT is based on full entries transferred during uTLB updates. As update information generated by the cache is physically addressed, uTLB and TLB need to be modified to allow lookups based on physical in addition to virtual PageIDs. Furthermore, the finite number of TLB entries might require the eviction of a page that still has corresponding lines within the cache. Should one of these lines be re-accessed later on, a new WT entry is allocated and all way information invalidated. To compensate for the loss of information, the last read uWT entry is held alive and way information updated on subsequent cache hits if necessary.
(107) Important parameters for way determination schemes are energy consumption, latency and scalability. As uWT/WT accesses and address comparisons within the Arbitration Unit are handled in parallel, latencies introduced by both components may be overlapped. The scheme is designed to simultaneously deliver way predictions for all lines within one page, and is therefore independent of MALEC's actual computation performance. The scheme becomes more energy efficient for wider L1 cache lines and address busses (e.g. 48 bit), because WT entries are reduced in size and savings due to re-used TLB address lookups increased, respectively. Although higher cache associativities would require wider WT entries, the additional energy saved due to way determination actually increases the scheme's efficiency. Finally, as large pages (e.g. 64K) would significantly increase the number of lines hold per WT entry, the scheme requires TLB entries to be quantized in 4 Kbyte segments when entering the uTLB. The WT itself can be segmented into a small number of chunks, each representing data corresponding to a 4 Kbyte address space. By allocating and replacing chunks in a FIFO or LRU manner, their number can be smaller than required to represent full pages.
6. Evaluation and Results
(108) In order to evaluate the impact of MALEC on performance and energy consumption and to compare it to existing schemes, the gem5 Simulator System was extended to support an enhanced processor-cache interface capable of modelling the micro-architectural aspects involved in this study with cycle-level accuracy. Access statistics obtained from gem5 are then combined with energy estimates calculated using CACTI v.6.5 to determine the energy consumption of the data cache subsystem, including both static and dynamic components. In particular, the evaluation includes the energy contribution of the following structures: L1 data cache (including tag&data SRAM arrays and control logic), uTLB+uWT and TLB+WT. While the modelled L1 cache interface includes other structures, like LQ, SB, and MB, their contribution to the overall energy consumption is not taken into account for two reasons: first, L1, uWT and WT account for the majority of transistors of the L1 interface, and therefore its leakage power. Second, the energy contributed by other components like LQ, SB and MB is very similar between MALEC and the analysed baselines. Our simulations show that this is also the case for lower memory levels, i.e. L2 cache and main memory, as MALEC alters the timing of L2 accesses, but does not significantly impact their number or miss rate.
(109) TABLE-US-00001 TABLE 1 Address computations uTLB/TLB Cache Ports per cycle ports per Bank Base1ldst 1 ld/st 1 rd/wt 1 rd/wt Base2ld1st 2 ld + 1 st 1 rd/wt + 2 rd 1 rd/wt + 1 rd MALEC 1 ld + 2 ld/st 1 rd/wt 1 rd/wt
Table 1 characterizes the analyzed baseline and the chosen MALEC configuration in terms of potential address computations per cycle (ld. load, st. store, ld.st. load or store), as well as number of uTLB, TLB and cache ports (rd. read, wt. write, rd/wt. read or write). While Base1ldst is restricted to a single load or store per cycle, Base2ld1st represents a high performance configuration allowing up to two loads and one store in parallel. As the simulated processor is optimized for Base2ld1st, the MALEC configuration introduced in Section 4 is scaled down to achieve a similar performance. This allows fair comparisons particularly between Base2ld1st and MALEC.
(110) TABLE-US-00002 TABLE 2 Component Parameter Processor Single-core, out-of-order, 1 GHz clock, 168 ROB entries, 6 element fetch&dispatch width, 8 element issue width L1 interface 64 TLB entries, 16 uTLB entries, 40 LQ entries, 24 SB entries, 4 MB entries L1 D-cache 32 Kbyte capacity, 64 byte line size, 2 cycle latency, 4 independent banks, 4-way set- associative, physically indexed, physically tagged L2 cache 1 MByte capacity, 16-way set associative CACTI 32 nm technology, design objective low dynamic power, cell type low standby power for data & tag arrays and high performance for peripherals, L1 with ECC
The configuration of the simulated system is based on an ARMv7-compatible single-core out-of-order processor operating at 1 GHz. Relevant configuration parameters of the analyzed processor-cache interface are summarised in Table 2. The benchmark suites utilized in this study, MediaBench 2 (MB2), SPEC CPU2000 Int and FP, represent a set of workloads with a multitude of different memory access behaviours. In case of MB2 we enabled automatic vectorization using the NEON SIMD engine, to increase their pressure on the L1 interface by these benchmarks. In order to reduce simulation times, SimPoint v.3.1 was used to identify the most representative execution phase of each benchmark. Each phase includes 1 billion instructions of the corresponding reference working set.
6.1 Performance Evaluation
(111)
(112) Performance benefits granted by MALEC over Base1ldst originate from two mechanisms: grouping of loads to the same cache line and accessing multiple cache banks in parallel.
(113) 6.2 Energy Evaluation
(114)
(115) As leakage contributes about 50% to the overall energy consumption in the analyzed 32 nm technology library, it is important to account for it.
(116) Alternative way determination schemes may be considered to reduce MALEC's leakage power. Similar to the way table proposed here, a way determination unit (WDU) (see Section 1) can be implemented to support reduced cache accesses as introduced in Section 5. It holds way information of recently accessed cache lines in a small buffer structure.
(117) One approach to directly improve MALEC's leakage power consumption is to reduce the number of uTLB entries and therefore the uWT size. However, this effectively reduces the uWT's coverage, which increases the dynamic energy consumed due to more WT accesses. Our simulations show that for the analyzed MALEC implementation this trade-off leads to overall energy consumption widely independent of the uTLB size; i.e. the energy difference between 4, 8, 16 and 32 entries is no more than 1%. Abandoning the WT completely in favour of lower leakage significantly reduces the number of L1 accesses covered by way determination. A simulation utilizing a 16 entry uWT and no WT achieved an average coverage of approximately 70%, increasing MALEC's energy consumption by 5%.
7. Conclusions
(118) This technique addresses the problem of high energy consumption in L1 cache interfaces to high performance out-of-order superscalar processors. The Multiple Access Low Energy Cache (MALEC) is based on the observation that consecutive memory references are very likely to access the same page. It shares memory address translation results between multiple loads and stores, simplifies store and merge buffer lookup structures and shares L1 data among loads accessing the same cache line. Furthermore, it introduces a novel way determination scheme that simultaneously provides way information for all cache lines mapping to a single page. MALEC is evaluated based on simulations of a 32 nm implementation utilizing a 32 KByte, 4-way set-associative L1 data cache with 64 Byte lines and an aggressive out-of-order processor to execute SPEC CPU2000 and Media-Bench2 benchmarks. Compared to a basic cache interface, capable of servicing 1 load or store per cycle, the chosen MALEC configuration achieves 9% speedup using 12% less energy. Contrarily, a conventional interface that achieves only a slightly higher performance than MALEC consumes 51% more instead of 12% less energy than the baseline.
(119) Although illustrative embodiments of the invention have been described in detail herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments, and that various changes and modifications can be effected therein by one skilled in the art without departing from the scope and spirit of the invention as defined by the appended claims.