Systems and methods for efficient cacheline handling based on predictions

Abstract

A data management method for a processor to which a first cache, a second cache, and a behavior history table are allocated, includes tracking reuse information learning cache lines stored in at least one of the first cache and the second cache; recording the reuse information in the behavior history table; and determining a placement policy with respect to future operations that are to be performed on a plurality of cache lines stored in the first cache and the second cache, based on the reuse information in the behavior history table.

Claims

1. A data management method for a multi-core processor system including a plurality of processor cores, a plurality of caches, and a behavior history table, the plurality of caches including first caches of a first cache level and second caches of a second cache level, the method comprising: tracking reuse information of learning cache lines stored in at least one cache of the first caches or the second caches; recording the reuse information in the behavior history table; determining a placement policy with respect to future operations that are to be performed on a plurality of cache lines stored in the first caches or the second caches, based on the reuse information in the behavior history table, wherein the second cache level is at a higher level than the first cache level, each of the first caches are private to a corresponding processor core among the plurality of processor cores, the second caches are shared among the plurality of processor cores, the reuse information includes a plurality of reuse counters corresponding, respectively, to the plurality of processor cores, and each reuse counter, from among the plurality of reuse counters, corresponds to a different one processor core from among the plurality of processor cores; incrementing the reuse counters each time learning cache lines stored in the second caches are accessed by the corresponding processor core; calculating a total reuse count by adding up the reuse counters; and recording the total reuse count by a separate counter in the behavior history table.

2. The method of claim 1, wherein the method further comprises: updating, by the multi-core processor system, at least one behavior counter from among a plurality of behavior counters included in the reuse information each time a type of usage corresponding to the at least one behavior counter occurs with respect to at least one of the learning cache lines.

3. The method of claim 2, wherein the method further comprises: updating, by the multi-core processor system, at least one behavior counter from among the plurality of behavior counters each time at least one of the learning cache lines is accessed by a read request.

4. The method of claim 1, further comprising: storing the determined placement policy in the behavior history table.

5. The method of claim 1, further comprising: randomly selecting at least some cache lines among the plurality of cache lines stored in at least one cache of the first caches or the second caches as the learning cache lines.

6. The method of claim 1, wherein the plurality of processor cores includes a first core and a second core, and wherein the first core has a shorter access time to at least one cache of the first caches than the second core, and the second core has a shorter access time to at least one cache of the second caches than the first core.

7. A multi-core processor system comprising: a plurality of processor cores; and a plurality of caches; and a behavior history table, the plurality of caches including first caches of a first cache level and second caches of a second cache level higher than the first cache level, the first and second cache levels being different cache levels with respect to each other, wherein, at least one processor core from among the plurality of processor cores is configured to determine a placement policy with respect to future operations that are to be performed on a plurality of cache lines stored in at least one cache of the first caches or the second caches, based on reuse information recorded in a behavior history table, wherein the reuse information is information about reuse of learning cache lines is stored in at least one cache of the first caches or the second caches, the reuse information including a plurality of reuse counters corresponding, respectively, to the plurality of cores, wherein each reuse counter, from among the plurality of reuse counters, corresponds to a different one processing core from among the plurality of processing cores, wherein each of the first caches are private to a corresponding processor core among the plurality of processor cores, wherein the reuse counters are incremented each time the learning cache lines stored in the second caches is accessed by the corresponding processor core, and wherein a total reuse count calculated by adding up the reuse counters is recorded by a separate counter in the behavior history table.

8. The system of claim 7, wherein the at least one processor core is configured to update at least one reuse counter from among the plurality of reuse counters each time a type of usage corresponding to the at least one reuse counter occurs with respect to at least one of the learning cache lines.

9. The system of claim 7, wherein the behavior history table includes at least one unused counter, and the at least one processor core is configured to update the at least one unused counter each time at least one of the learning cache lines is replaced before a single reuse of the learning cache lines has occurred.

10. The system of claim 7, wherein the behavior history table includes a policy field that stores a policy in accordance with the placement policy determined by the at least one processor core.

11. The system of claim 7, wherein each of the learning cache lines is extended with a reuse information field that is configured to store reuse information of the learning cache lines.

12. The system of claim 7, wherein each learning cache line is extended with a learning bit indicating that the learning cache line is a learning cache line.

13. A non-transitory computer-readable storage medium comprising instructions that, when executed by at least one processor core of a multi-core processor system including a plurality of processor cores, a plurality of caches, and a behavior history table, the plurality of caches including first caches of a first cache level and second caches of a second cache level, cause the at least one processor core to perform operations including, tracking reuse information of learning cache lines stored in at least one cache of the first caches or the second caches; recording the reuse information in the behavior history table; determining a placement policy with respect to future operations that are to be performed on a plurality of cache lines stored in the first caches or the second caches, based on the reuse information in the behavior history table, wherein the second cache level is at a higher level than the first cache level, each of the first caches are private to a corresponding processor core among the plurality of processor cores, the second caches are shared among the plurality of processor cores, the reuse information includes a plurality of reuse counters corresponding, respectively, to the plurality of processor cores, and each reuse counter, from among the plurality of reuse counters, corresponds to a different one processor core from among the plurality of processor cores; Incrementing the reuse counters each time learning cache lines stored in the second caches are accessed by the corresponding processor core; calculating a total reuse count by adding up the reuse counters; and recording the total reuse count by a separate counter in the behavior history table.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) The above and other features and advantages of example embodiments of the inventive concepts will become more apparent by describing in detail example embodiments of the inventive concepts with reference to the attached drawings. The accompanying drawings are intended to depict example embodiments of the inventive concepts and should not be interpreted to limit the intended scope of the claims. The accompanying drawings are not to be considered as drawn to scale unless explicitly noted.

(2) FIG. 1 is a block diagram illustrating a portion of a computer system;

(3) FIG. 2 is a block diagram explaining an example cache hierarchy of a computer system;

(4) FIG. 3 is a block diagram for explaining a conventional implementation of a two-level cache hierarchy including set-associative caches and a set-associative data translation look-aside buffer (TLB);

(5) FIG. 4 is a block diagram illustrating an example implementation of a tag-less cache;

(6) FIG. 5 is a block diagram illustrating a portion of a computer system including two CPUs connected to a two-level cache hierarchy and a two-level cache location buffer (CLB) hierarchy;

(7) FIG. 6 is a block diagram illustrating a portion of a computer system including a tag-less cache hierarchy with a single monolithic last level cache;

(8) FIG. 7 is a block diagram illustrating a generalized tag-less cache hierarchy with many slices of last level cache;

(9) FIG. 8 is a block diagram illustrating a portion of a computer system including a tagged cache hierarchy extended to support a future behavior prediction (FBP) according to at least some example embodiments of the inventive concepts;

(10) FIG. 9 is a block diagram illustrating a portion of a computer system including a tag-less cache hierarchy extended to support a future behavior prediction (FBP) according to at least some example embodiments of the inventive concepts;

(11) FIG. 10 is a block diagram showing three alternative ways to implement a behavior history table (BHT) according to at least some example embodiments of the inventive concepts;

(12) FIG. 11 is a block diagram showing a portion of a computer system including a non-uniform cache architecture (NUCA) cache system, where both the L2 and L3 are non-uniform caches, according to at least some example embodiments of the inventive concepts;

(13) FIG. 12 is a block diagram showing a behavior history table (BHT) targeting NUCA placement, according to at least some example embodiments of the inventive concepts; and

(14) FIG. 13 is a block diagram showing a behavior history table (BHT) targeting general prediction and modification or, alternatively, optimization in a computer system, according to at least some example embodiments of the inventive concepts.

DETAILED DESCRIPTION

(15) As is traditional in the field of the inventive concepts, embodiments are described, and illustrated in the drawings, in terms of functional blocks, units and/or modules. Those skilled in the art will appreciate that these blocks, units and/or modules are physically implemented by electronic (or optical) circuits such as logic circuits, discrete components, microprocessors, hard-wired circuits, memory elements, wiring connections, and the like, which may be formed using semiconductor-based fabrication techniques or other manufacturing technologies. In the case of the blocks, units and/or modules being implemented by microprocessors or similar, they may be programmed using software (e.g., microcode) to perform various functions discussed herein and may optionally be driven by firmware and/or software. Alternatively, each block, unit and/or module may be implemented by dedicated hardware, or as a combination of dedicated hardware to perform some functions and a processor (e.g., one or more programmed microprocessors and associated circuitry) to perform other functions. Also, each block, unit and/or module of the embodiments may be physically separated into two or more interacting and discrete blocks, units and/or modules without departing from the scope of the inventive concepts. Further, the blocks, units and/or modules of the embodiments may be physically combined into more complex blocks, units and/or modules without departing from the scope of the inventive concepts.

(16) During an execution of an application, many costly operations upon, and movements of, data units are performed. The cost for some of these operations depends on which operations have been applied to the data unit previously, e.g., a read request to a data unit will lower the cost for a subsequent write operation to the data unit if the data unit is brought into the L1 cache in a writable state. Furthermore, operations applied to private regions can be handled more desirably or, alternatively, optimally if it is known that the region is likely to stay private in the future. Also the cost of data movements depends upon placement decisions made by previous operations. For example, if data likely to be reused is placed in a faster cache than data that are less likely to be reused, future operations are more likely to find a requested data in the fast cache. Accordingly, it would be desirable to provide systems and methods that predict future operations and enable more desirable or, alternatively, optimal choices to be made for the current operations.

(17) According to at least some example embodiments of the inventive concepts, a Future Behavior Prediction (FBP) mechanism can be used to predict such future operations. According to at least one example embodiment of the inventive concepts, FBP is built from a combination of some or all of these 5 components:

(18) 1. Identifying dataset: The behavior may be tracked for each individual cache line. Another alternative is to track the behavior for a group of cache lines that are believed to have a similar behavior, here referred to as a dataset. According to at least one example embodiment of the inventive concepts, data units located close to each other in the address space are determined to belong to the same dataset. For example, according to at least some example embodiments, the address space may be divided into N different groups of contiguous addresses. Further, the N groups of addresses may correspond, respectively, to N datasets such that data units having addresses included in a particular group, from among the N groups of addresses, are considered to belong to the data set, from among the N datasets, to which the particular group corresponds. According to at least one example embodiment of the inventive concepts, each dataset may be identified by assistance from the programmer, the compiler and/or a runtime system. According to at least one example embodiment of the inventive concepts, the Program Counter (PC) value (i.e., the value or instruction address stored in the PC) identifying the instruction that brings a cache line into the cache hierarchy from memory, or from a cache level higher than a specific FBP level threshold, is used to identify the dataset it belongs to. According to at least another example embodiment of the inventive concepts, the PC value that caused a TLB fault for the page where the data resides is used to identify the dataset of that page. According to at least another example embodiment of the inventive concepts, the PC value that caused a CLB miss at a certain CLB level for a region where the data resides is used to identify the dataset of that region. According to at least another example embodiment of the inventive concepts, the PC value of an instruction that generated at least one of the “cache line requests” that initiated a hardware prefetch stream to start is used to identify the dataset. According to at least another example embodiment of the inventive concepts, call stack information (for example, the identity of the PC values of the last functions calls) is used to identify the dataset. According to at least one example embodiment of the inventive concepts, two or more of the above schemes are combined for identifying a dataset. Those skilled in the art will realize that, in order to save storage space, both the call stack and the PC value may be represented by some subset of their address bits or by some other transformation function using their address bits as an input. According to at least one example embodiment of the inventive concepts, the dataset is identified by a dataset identifier (DID). According to at least one example embodiment of the inventive concepts, the DID is composed by, at least in part, some bits from a CP, some call stack information and/or some address bits of an address range.

(19) 2. Detecting special usage: One or many types of special usages to a cache line may be detected and recorded. For example, according to at least one example embodiment of the inventive concepts, the number of special usages of a certain type (e.g., read accesses) to the cache line or a dataset is tracked and recorded by a counter counting the number of times that special usage occurs to a cache line or dataset. Every type of cache line usage possible may be recorded as a special usage. Types of such special usage to be tracked and recorded include, but are not limited to, read accesses, write accesses, cache allocations, cache evictions, cache eviction of a cache line that has never been reused, conversion of a region from private region to shared region, conversion of a cache line which is only readable to become writeable, the number of cache lines currently residing in the cache hierarchy, or the number of regions or pages currently residing in the cache hierarchy. According to at least one example embodiment of the inventive concepts, the reuse information consists of a single reuse bit that records if a cache line, region or page has been accessed at all after its initial installation (or, storage) at a specific level. According to at least one example embodiment of the inventive concepts, the reuse for a cache line at a specific cache level is determined by looking at the cache line's reuse information when the cache line is replaced. Someone skilled in the art understands that many more special usage types are possible to track and record and that enumerating a complete list is unnecessary. According to at least one example embodiment of the inventive concepts, some special usages of cache lines are recorded per core while other special usages are recorded for the entire system. Modern computers are often equipped with a multitude of event counters capable of counting a large number of different hardware events. All such events could also be recorded by the described mechanism.

(20) 3. Selective learning: Sometimes, recording every special usage for all cache lines could be too costly. According to at least one example embodiment of the inventive concepts, so-called learning cache lines are selected and special usage(s) are only collected for these cache lines. According to at least one example embodiment of the inventive concepts, learning cache lines are selected randomly. According to at least one example embodiment of the inventive concepts, only cache lines belonging to certain pages, regions or other type of address ranges (which may be referred to as learning pages, learning regions or learning address ranges) are learning cache lines. According to at least one example embodiment of the inventive concepts, each such learning page, region or address range is selected randomly. According to at least one example embodiment of the inventive concepts, each such page, region or address range is marked as a learning address range or as a learning cache line. The learning cache lines may also be selected based on which dataset (DID) they belong to. According to at least one example embodiment of the inventive concepts, all cache lines are learning cache lines. One could also combine several of the selection methods described above.

(21) According to at least one example embodiment of the inventive concepts, learning cache lines are operated upon in a special way. For example, a learning cache line may be installed in all cache levels, while the rest of the cache lines will be installed only in the levels identified by a certain placement policy, e.g., a placement policy associated with their DID. According to at least one example embodiment of the inventive concepts, special usage is only detected, as described above in “2. Detecting special usage”, for learning cache lines.

(22) 4. Recording special reuse: When a special usage to a learning cache line is detected, this detection is recorded in a Behavior History Table (BHT). According to at least one example embodiment of the inventive concepts, a Behavior History Table (BHT) is used to record the data reuse. BHT collects reuse information from learning cache lines at different cache levels. In one implementation, each entry in the BHT is associated with a BHT identifier (BHTI), at least part of which is a dataset identifier (DID). Each BHT has some number of behavior counters (BC), which are updated each time a corresponding special usage for the dataset associated with the BHT entry is recorded. A BHT may be organized as an associative storage indexed by some of the BHTI bits and tagged by some of the BHTI bits. A BHT may also be organized as a table indexed by some BHTI bits, but with no tag.

(23) When a special usage of a learning cache line is detected, an associated BHT entry is selected, at least in part by using the DID associated with the cache line. The behavior counter (BC) of the selected BHT entry corresponding to the special usage detected is incremented or decremented.

(24) 5. History-based policy: Based on the reuse information collected in a BHT, a policy can be determined for future operations to certain cache lines, regions, pages or other address ranges. The policy can for example be based on the assumption that the counter values collected for a dataset will be representative for the dataset's future behavior. For example, the counter for one or many BHT entries can be examined periodically and policies for future accesses to datasets corresponding to a BHT entry, or several BHT entries, can be determined. For example, for a dataset Z identified by a DID that has shown good reuse (e.g., reuse equal to or above a threshold value which may be set based on empirical analysis) at cache level X but not at cache level Y, the corresponding future policy is to install the dataset Z in cache level X, but not cache level Y. In another example, a dataset A, identified by a DID, that has shown a more frequent reuse than a dataset B, identified by a different DID, when accessing a cache with variable latency (e.g., a non-uniform cache architecture (NUCA)), then the future policy is to install dataset A in a faster portion of the cache and dataset B in a slower portion of the cache. In yet another example, for a dataset C, identified by a DID, that has shown a better reuse than a dataset D, identified by a different DID, and where dataset C has been identified to be accessed mostly by a CPU P, the future policy is to install dataset C in a cache or a portion of a cache with a shorter access with respect to CPU P, after which the appropriate placement for dataset D is determined.

(25) The most recent policy decision for each BHT entry can be stored with the BHT entry. For example, before making an installation decision for a cache line of a dataset A identified be a specific DID, that DID can be used to find a corresponding entry in the BHT and its most recent policy used to guide the installation of the cache line.

(26) FIG. 8 is a block diagram illustrating a portion of a computer system including a tagged cache hierarchy extended to support a future behavior prediction (FBP) according to at least some example embodiments of the inventive concepts. FIG. 8 shows one example of an implementation based future behavior prediction (FBP) in a traditional cache hierarchy, i.e., with tagged caches rather than tag-less caches, according to at least one example embodiment of the inventive concepts that can be viewed as an extension of FIG. 3. Even though FIG. 8 only shows two cache levels, its functionality can be generalized to more than two cache levels. The two-level cache hierarchy shown in FIG. 8 may also form one node in a multiprocessor system built from many such nodes. Each cache line in the L1 cache 810 and in the L2 cache 820 has been extended with a reuse information field R (813 and 823), recording the reuse behavior of the cache line while it resides in the cache, a learning bit L (811, 821) indicating whether or not the corresponding cache line is a learning cache line, and a dataset DID field (812, 822), identifying a dataset with the corresponding cache line. The remaining unnumbered elements of the cache hierarchy of FIG. 8 function in the same way as corresponding elements illustrated in FIG. 3 and reference is made to the foregoing description of those elements above.

(27) A behavior history Table (BHT) 870 has also been added. Each entry in the table 870 contains a dataset identifier DID 871, identifying the dataset associated with each table entry and for example used as an address tag to allow for associate lookups in the BHT 870 structure, and some reuse information collected for that dataset. In this implementation example, counters counting the number of learning cache lines with reuses at each level (1R, 2R, 3R . . . ) are shown (873, 875). Also counters counting the number of unused learning cache lines (1U, 2U, . . . ) are shown (874, 876). Based on the counter values, a placement policy for the dataset is selected. The current placement policy is stored in the policy field, POL 872. According to at least one example embodiment of the inventive concepts, the policy is represented by one bit for each level in the cache hierarchy indicating whether or not the dataset identified by DID 871 should be installed in that cache level. Those skilled in the art will understand that a similar functionality can be achieved using a multitude of different implementation choices, including some embodiments in FIG. 10, which is discussed in greater detail below.

(28) FIG. 10 is a block diagram showing three alternative ways to implement a behavior history table (BHT) according to at least some example embodiments of the inventive concepts. Referring to FIG. 10, example (A) shows a set-associative BHT 1000, where some index function based in part on DID selects one set (the top-most set shown in this example) and all address tags of that set are compared to some lookup key to determine a hit. Part of the DIDs is used as address tags 1001 and 1004 to identify the set-associative entries in BHT 1000. In this example we assume a hit for 1001. Its BHT entry contains a set of counters, as discussed above, and some policy POL 1002 which has been determined by previous counter values. In example (B), some portions of a DID is used to select an index to access the BHT. The chosen single entry will be representative for that DID and all other DIDs with the same index function without any comparisons performed (in the example, the top-most entry is indexed using a DID and this entry used, for example the POL 1012 is used to determine the policy). Example (C) is a system with one BHT 1020 and several policy tables (PT 1022 and PT 1023), possibly distributed to be located close to each CPU core.

(29) According to at least one example embodiment of the inventive concepts, FBP is used to make placement decisions for a cache hierarchy with four cache levels: 4 kB, 32 kB, 256 kB and 8 MB, respectively. Each cache entry is extended to store a learning bit (L), one or more reuse bits and a dataset identifier consisting of the 12 lowest bits of the PC value that brought the cache line from memory into the cache hierarchy. The BHT is organized as a set-associative cache with 256 sets of four ways each. A BHT entry contains a DID tag of 6 bit, a policy field of 4 bits (each corresponding to the four cache levels) and two counters U and R of 6 bits each for each cache level. When either of the two counters reaches its maximum value or, alternatively, a threshold value, a decision is made to install data in the corresponding cache level if the corresponding R counter value is higher than the threshold, e.g., 48. Over a wide set of applications, FBP according to these embodiments is shown to make substantially fewer installations at each cache level. On average, FBP performs fewer than 50% of the installs compared with a standard cache hierarchy with no placement policy.

(30) FIG. 9 is a block diagram illustrating a portion of a computer system including a tag-less cache hierarchy extended to support a future behavior prediction (FBP) according to at least some example embodiments of the inventive concepts. FIG. 9 shows an example of an implementation of FBP in a tag-less cache hierarchy according to at least another example embodiment of the inventive concepts, which can be viewed as an extension of FIG. 4. Even though FIG. 9 only shows two cache levels, its functionality can be extended to more than two cache levels. Each cache line in L1 cache 930 and L2 cache 940 has been extended with a reuse information field R 934 and 944, recording the reuse behavior of the cache line while it resides at each cache level, and a dataset DID field, identifying a dataset of the cache line. In this example, a dataset identifier (DID) has not been added to each cache line in the L1 and L2 caches. Each entry in CLB1 910 and CLB2 920 have instead been extended with DID information 914 and 915, respectively, associated with each of their entries. The C2P pointers 932, 942 of each cache line in L1 cache 930 and L2 cache 940 point to entries in CLB2 920 (that may in turn may point to CLB1 910). The DID of the associated CLB 2 entry 915 determines the dataset ID for each cache line in L1 cache 930 and L2 cache 940. The remaining elements in this portion of FIG. 9 function in the same way as those described above with respect to FIG. 8 and reference is made to that description for conciseness.

(31) A Behavior history Table (BHT) 970, similar to the one in FIG. 8, has also been added to the system of FIG. 9. Each entry in the table contains a dataset DID 971, identifying the dataset associated with each table entry, and some reuse information collected for that dataset. In this example, counters counting the number of reuses at each level 1R, 2R, 3R, . . . are shown and referenced as 973, 975, 977, respectively. Also counters counting the number of unused cache lines 1U, 2U, . . . are shown and referenced as 974 and 976. Based on the counter values, a placement policy 972 for the dataset is shown. Those skilled in the art will appreciate that a similar functionality can be achieved using a multitude of different implementation choices.

(32) The BHTs shown in FIGS. 8 and 9 may be part of multiprocessor configurations, where the CPU and caches shown in the FIGS. form nodes which are part of multicore configurations built from two or more such nodes. In such configurations, a BHT may be local to the node and information about collect special usage in the nodes, or global to the multiprocessor and information about special usage across all the nodes in the multiprocessor. According to at least one example embodiment of the inventive concepts, a multiprocessor may have both BHTs local to its node and a global BHT.

(33) As earlier discussed, a dataset may be identified at least in part by a PC value of an instruction that generated at least one of the “cache line requests” that caused a hardware prefetch stream to start. This dataset will select learning accesses like any other dataset and learn the best placement strategy across the cache levels for the prefetched dataset, similarly to any other dataset described in accordance with one or more example embodiments of the inventive concepts.

(34) So-called non-uniform cache architectures (NUCA) are becoming more common. NUCA refers to a multiprocessor system where one or more cache levels are logically shared between the cores, but physically distributed between the cores. In a NUCA system, a core will have a shorter access time to “its slice” of the shared NUCA cache than to some other slice of the NUCA shared cache. FIG. 11 is a block diagram showing a portion of a computer system including a non-uniform cache architecture (NUCA) cache system, where both the L2 and L3 are non-uniform caches, according to at least some example embodiments of the inventive concepts. FIG. 11 shows a NUCA multicore system, where CPUs 1101 and 1120 both have private L1 caches 1102 and 1122, respectively. The L2 cache is a logically shared NUCA cache implemented by separate L2 slices (e.g., 1103 and 1123), which are connected to the CPU by a SWITCH 1140. Each CPU, (e.g., 1101 and 1120) can access all the L2 slices but has a shorter access time to their L2 slice (1103 and 1123, respectively). Each L2 slice also has an adjacent L3 slice. The L3 slices form a logically shared NUCA cache in a similar way in that CPU 1101 has a shorter access time to the L3 1104 of its slice than to any other L3 slice. However, the access time to an L3 slice is substantially longer than the access time to the adjacent L2 slice. We refer to this L2/L3 NUCA structure with NUCA caches of two levels as a two-dimensional NUCA array.

(35) It would be beneficial if cache lines could be placed close to the core accessing them. It would also be beneficial to place the most frequently reused cache lines in the L2 cache rather than in the L3 cache. In a NUCA, such as the one shown in FIG. 11, the access cost to each L2 and L3 slice can be determined based on the latency cost, communication cost and energy cost for an access starting at each CPU and accessing each of the L2 and L3 slices.

(36) A NUCA Aware Placement algorithm (NAP) is a specialized implementation of FBP targeting desirable or, alternatively, optimal cache line placements in NUCA systems. The initial NAP description targets a tag-less NUCA system, e.g., the system depicted in FIGS. 6, 7 and 9 modified to have NUCA L2 and L3 NUCA caches and have its BHT table 970 replaced by the NUCA history table (NHT) 1210, which will be discussed in greater detail below with reference to FIG. 12. A similar NAP algorithm can also be applied to tag-based NUCA systems, e.g., FIG. 8 modified to have L2 and L3 NUCA caches and its BHT table 870 replaced by the NUCA history table (NHT) 1210, which will be discussed in greater detail below with reference to FIG. 12.

(37) A NAP identifies the dataset of each region with a DID, as shown in FIG. 9 (e.g., 914 and 915) and may have specially assigned learning regions, which are marked by a dedicated L bit (not explicitly shown) within the Region Information field (913).

(38) FIG. 12 is a block diagram showing a behavior history table (BHT) targeting NUCA placement, according to at least some example embodiments of the inventive concepts. Referring to FIG. 12, FIG. 12 shows the NUCA History Table (NHT) 1210 used by the NAP placement mechanism. Similar to FBP, this table can be represented in many ways. The representation shown in FIG. 12 is similar to the FBP representation (B) of FIG. 10 (1010). The goal of NAP is to determine the policy (POL 1211) for the associated DID used to index to the NHT entry. To assist with policy determination, each NHT entry stores a number of counters updated by special usage to cache lines associated with the entry.

(39) According to at least one example embodiment of the inventive concepts, there is one reuse counter per core, shown as C1, C2, C3 and C4 of FIG. 12 (1212, 1213, 1214, 1215 respectively), assuming four CPUs (cores) in the system. According to at least one example embodiment of the inventive concepts, a size counter S (1217) is used to estimate the size for the data structures associated with the NHT entry. According to at least one example embodiment of the inventive concepts, an “unused counter” counts the number of cache lines being replaced before a single reuse to the cache line has occurred.

(40) The per-core reuse counters of a NAP entry are incremented each time when a learning cache line in L2 or L3 associated with the entry is accessed by the corresponding core. According to at least one example embodiment of the inventive concepts, each counter is incremented only for accesses of a certain type, for example only for read accesses. According to at least one example embodiment of the inventive concepts, each counter is incremented for all accesses and not just for learning cache lines.

(41) The size counters of a NAP entry are incremented each time a data unit associated with the entry is brought into the cache system and decremented each time a data unit (e.g., a data unit associated with the entry) is evicted from the cache system. According to at least one example embodiment of the inventive concepts, the size counter of a NAP entry is incremented/decremented each time a CLB region associated with the entry is allocated/evicted at some level of the CLB hierarchy. According to at least one example embodiment of the inventive concepts, the size counter of a NAP entry is incremented/decremented each time a page associated with the entry is allocated/evicted at some level of the TLB hierarchy. According to at least one example embodiment of the inventive concepts, the allocation and eviction of some other data entity associated with the entry will increment and decrement the size counter.

(42) According to at least one example embodiment of the inventive concepts, an NHT entry contains an “unused” counter U 1216. The “unused” counter 1216 is incremented each time a data unit that has never been reused at a certain cache level is evicted from that cache level. According to at least one example embodiment of the inventive concepts, the unused counter is incremented each time a data unit that has never been reused at certain caches level is evicted past a certain cache level, for example the data unit has never been reused in the L2 or L3 levels and is evicted to a cache level higher than L3 or to memory. The unused counter 1216 can be used to determine that a dataset should bypass the L2/L3 caches and only be installed in L1 cache.

(43) Periodically, the placement policy in the NUCA hierarchy is reassessed based on data collected in the NHT 1210. This could for example be after a certain number of instructions have executed, after a number of memory accesses have been performed, after some number of cycles of execution or when some counter has reached a threshold or, alternatively, predetermined value. Someone skilled in the art would appreciate that many other forms to determine the next placement reassessment could be used.

(44) During the placement reassessment, NHT entries are ordered according to some priority. According to at least one example embodiment of the inventive concepts, NHT entries are ordered by their total reuse count in relationship to their size, e.g., by dividing their total reuse count by their size count or some other way to estimate the relationship. According to at least one example embodiment of the inventive concepts, the total reuse count can be calculated by adding up the individual per-core reuse counters 1211, 1212, 1213, 1214. According to at least one example embodiment of the inventive concepts, the total reuse count is recorded by a separate counter in each NHT entry.

(45) During the placement reassessment, a placement policy for each dataset in NHT is determined in some priority order, where each dataset corresponds to an NHT entry. The highest priority dataset is placed in a cache with a lowest cost function with respect to the core or cores accessing the dataset. According to at least one example embodiment of the inventive concepts, the cost function is taking the latency and/or the communication cost from the core to the cache into account. According to at least one example embodiment of the inventive concepts, the power estimate for an access from the core to the cache is taken account. According to at least one example embodiment of the inventive concepts, the estimated size of the dataset is taken into account. If the dataset size is deemed be appropriate to fit into the selected cache, a portion of that cache proportional to the size of the dataset is marked as being used. If the dataset size is deemed too large to fit into the selected cache, the entire cache is marked as used and the remaining portion of the dataset is fitted into the cache with the second lowest cost function, and so on until the entire dataset has been fitted. According to at least one example embodiment of the inventive concepts, the fraction of the dataset fitted into each cache is recorded as the placement policy for the dataset, for example 25% of the dataset is placed in the L2 cache of CPU1's slice, 25% of the dataset is placed in the L2 cache of CPU2's slice and 50% of the dataset is placed in the L3 cache of CPU1's slice. When the highest priority dataset has been placed, the second highest dataset is placed in the caches not yet marked as used, and so on until all datasets not deemed to bypass L2/L3 have been placed.

(46) According to at least one example embodiment of the inventive concepts, some datasets will be determined to bypass the L2/L3 NUCA caches and will not be placed in any of its caches. According to at least one example embodiment of the inventive concepts, the remaining datasets are placed according to some dataset size distribution between the caches. According to at least one example embodiment of the inventive concepts, the placement will strive to achieve the same ratio between dataset size placed in each cache and its actual size. According to at least one example embodiment of the inventive concepts, the placement will strive to achieve the same cache pressure between the cache slices, where cache pressure for example can be measured as the number of evictions from the cache per time used in relationship to its size. According to at least one example embodiment of the inventive concepts, the placement will strive towards a desired or, alternatively, predetermined relationship between cache pressure for caches at one level (e.g., L2) and some other level (e.g., L3). According to at least one example embodiment of the inventive concepts, the placement is striving towards achieving the same replacement age between cache lines replaced from all the caches, defined as how long time a cache line is unused in the cache until it is replaced. According to at least one example embodiment of the inventive concepts, the placement will strive towards a desired or, alternatively, predetermined relationship between replacement age for caches at one level (e.g., L2) and some other level (e.g., L3).

(47) The new determined placement is recorded as a new placement policy and is recorded as a policy associated with each dataset, e.g., in a policy field of the NHT entry of the corresponding dataset 1211 and/or in separate policy tables similar to 1022 or 1023 or with some other representation. Future installation of data into the NUCA hierarchy will adhere to the placement policy, for example 25% of the dataset is installed in the L2 cache of CPU1's slice, 25% of the dataset is installed in the L2 cache of CPU2's slice and 50% of the dataset is installed in the L3 cache of CPU1's slice.

(48) According to at least one example embodiment of the inventive concepts, the size and reuse frequency for each dataset is estimated. Periodically, a new global placement decision is made. First, the dataset with the highest reuse per size is placed in its most favorable spot. Then the dataset with the second highest frequency/size is placed and so on until all known sets have been placed using a simple eager packing algorithm. The goal of the placement is to place datasets with the highest reuse probability close to the core using it.

(49) While this discussion has centered around predicting future access patterns for a dataset and to leverage this prediction to achieve an efficient NUCA placement with respect to a cache hierarchy similar to that of FIG. 11, someone skilled in the art would realize that the method described can be generalized to predict many other kinds of future behavior and to apply modification or, alternatively, optimization policies for future operations.

(50) FIG. 13 is a block diagram showing a behavior history table (BHT) targeting general prediction and modification or, alternatively, optimization in a computer system, according to at least some example embodiments of the inventive concepts. FIG. 13 depicts a generalized General History Table (GHT), where DID is used to associate a dataset identified by this DID with a GHT entry. The GHT entry contains some number of counters, which for example counts any of the so-called hardware counters present in most modern computer systems that are configured to count one of multitude different events. Any existing or future counter counting events can be used as such a counter. As shown in FIG. 13, the counting of these events is here organized to count such events associated with the dataset DID for which the event occurred. While the FIG. shows two counters: CTR1 and CTR2 (1312, 1313), the number of counters per GHT entry is not bound to two. The counters can be used to determine some policy for the dataset, for example stored within the GHT entry itself as shown by 1311. However, there are many other ways the per-dataset counters can be and their policy can be organized, including but not limited to the organizations depicted in FIG. 10. Someone skilled in the art can for example understand how a multitude of future behavior can be detected using this method, including but not limited to: read-mostly cache lines, mostly-private regions, write-mostly cache lines, mostly migratory-sharing cache lines, mostly producer-consumer cache lines, mostly write-once cache lines, mostly read-once cache lines, regions or pages with mostly on cache line access, mostly sparse cache lines, mostly compressible cache lines can be predicted using this method. Most of these future behaviors have known modifications or, alternatively, optimizations which can be applied for their future usage of the respective dataset.

(51) For clarity, most descriptions herein generally describe techniques for how a cache line is located and returned to a requesting CPU. The descriptions do not describe in detail the various ways in which a requested word contained within the cache line is selected and returned to the CPU. However, various methods for selecting a requested word contained within a cache line and returning the requested cache line to the CPU are known by those skilled in the art.

(52) For clarity, most descriptions herein describing the handling of data of cache hierarchies describe exclusive cache hierarchies. Those skilled in the art would understand that one or more example embodiments of the inventive concepts can be extended to also cover inclusive memory hierarchies and non-exclusive memory hierarchies.

(53) Although one or more example embodiments of the inventive concepts described above are useful in association with both uni-processor systems and multi-processor system, such as those illustrated and described above with respect to FIGS. 1 and 2 respectively, one or more example embodiments of the inventive concepts are illustrated primarily in association with a uniprocessor system. However, those skilled in the art will appreciate that one or more example embodiments of the inventive concepts illustrated in association with a uni-processor system are not limited to such an implementation.

(54) Although described above in the context of certain example computer architectures, caching exists in many other settings within, as well as outside, the example computer systems illustrated in FIGS. 8-13, and those skilled in the art will understand that at least some example embodiments of the inventive concepts described above within the context of computer system may also be applied to such other contexts. An example of such usages is the virtual memory system which caches data from a slow, high-capacity storage, such as a disk or FLASH memories, into a faster and smaller high-capacity memory that could be implemented using dynamic RAM. Other examples of caching in a computer system include, but are not limited to, disk caching, web caching and name caching. The organization and caching mechanisms of such caches may vary from those of the caches discussed above, e.g., variances in the size of a set, their implementation of sets and associativity, etc. Regardless of the implementation of the caching mechanism itself, at least some example embodiments of the inventive concepts are applicable for implementing the various caching schemes.

(55) The methods or flow charts provided in the present application may be implemented in a computer program, software, or firmware tangibly embodied in a computer-readable storage medium for execution by a general purpose computer or a processor.

(56) Example embodiments of the inventive concepts having thus been described, it will be obvious that the same may be varied in many ways. Such variations are not to be regarded as a departure from the intended spirit and scope of example embodiments of the inventive concepts, and all such modifications as would be obvious to one skilled in the art are intended to be included within the scope of the following claims.

Systems and methods for efficient cacheline handling based on predictions

Assignee

Inventors

Cpc classification

Classification Explorer

G06F12/0811

PHYSICS

Classification Explorer

G06F12/0871

PHYSICS

Classification Explorer

G06F2212/6046

PHYSICS

Classification Explorer

G06F12/0897

PHYSICS

Classification Explorer

G06F12/126

PHYSICS

Classification Explorer

G06F12/0846

PHYSICS

Classification Explorer

G06F2212/271

PHYSICS

International classification

Classification Explorer

G06F12/0871

PHYSICS

Classification Explorer

G06F12/126

PHYSICS

Classification Explorer

G06F12/0846

PHYSICS

Classification Explorer

G06F12/0897

PHYSICS

Classification Explorer

G06F12/0811

PHYSICS

Abstract

Claims

Description