MEMORY CONTROLLER ZERO CACHE
20230236985 · 2023-07-27
Inventors
Cpc classification
International classification
Abstract
In one embodiment, a controller in a microprocessor, the controller configured to manage accesses to dynamic random access memory (DRAM), the controller comprising: a first table configured to track cache lines that have been written to zero for a plurality of first memory regions; and a second table configured to track the cache lines that have been written to zero for a plurality of second memory regions, wherein each of the plurality of second memory regions comprises a group of the plurality of first memory regions where all of the cache lines within each of the plurality of the first memory regions within the group have been written to zero.
Claims
1. A controller in a microprocessor, the controller configured to manage accesses to dynamic random access memory (DRAM), the controller comprising: a first table configured to track cache lines that have been written to zero for a plurality of first memory regions; and a second table configured to track the cache lines that have been written to zero for a plurality of second memory regions, wherein each of the plurality of second memory regions comprises a group of the plurality of first memory regions where all of the cache lines within each of the plurality of the first memory regions within the group have been written to zero.
2. The controller of claim 1, wherein the first table comprises plural entries corresponding to the plurality of first memory regions, wherein each entry comprises a tag corresponding to an address of one of the plurality of first memory regions and a bitmask corresponding to each of the cache lines within the one of the plurality of first memory regions.
3. The controller of claim 2, wherein each bit of the bitmask is set when a corresponding one of the cache lines is written to zero.
4. The controller of claim 2, wherein each of the plurality of first memory regions comprises a memory page.
5. The controller of claim 4, wherein the tag comprises upper physical address bits that specify a full memory page, and wherein the bitmask is based on a size of the memory page and a size of each of the cache lines.
6. The controller of claim 2, wherein based on all bits of the bitmask corresponding to each of the cache lines within the one of the plurality of first memory regions being set, information of the corresponding entry is migrated into the second table.
7. The controller of claim 2, wherein the second table comprises plural entries corresponding to the plurality of second memory regions, wherein each entry comprises a tag corresponding to an address of one of the plurality of second memory regions and a bitmask.
8. The controller of claim 7, wherein the bitmask of each entry of the second table corresponds to the group of the plurality of first memory regions contained within second memory region associated with the entry, and wherein all bits of the bitmasks of the group are set.
9. The controller of claim 8, wherein the group comprises the cache lines set for more than one memory page.
10. The controller of claim 9, wherein the tag comprises upper physical address bits that specify a full and contiguous N megabyte (MB) region, where N is sufficient storage for more than one memory page, wherein the bitmask is based on a quantity of the first memory regions.
11. A method performed by a controller in a microprocessor, the controller configured to manage accesses to dynamic random access memory (DRAM), the method comprising: tracking cache lines in a first table that have been written to zero for a plurality of first memory regions; and tracking the cache lines that have been written to zero for a plurality of second memory regions in a second table, wherein each of the plurality of second memory regions comprises a group of the plurality of first memory regions where all of the cache lines within each of the plurality of the first memory regions within the group have been written to zero.
12. The method of claim 11, wherein the first table comprises plural entries corresponding to the plurality of first memory regions, wherein each entry comprises a tag corresponding to an address of one of the plurality of first memory regions and a bitmask corresponding to each of the cache lines within the one of the plurality of first memory regions.
13. The method of claim 12, further comprising setting each bit of the bitmask when a corresponding one of the cache lines is written to zero.
14. The method of claim 12, wherein each of the plurality of first memory regions comprises a memory page.
15. The method of claim 14, wherein the tag comprises upper physical address bits that specify a full memory page, and wherein the bitmask is based on a size of the memory page and a size of each of the cache lines.
16. The method of claim 12, wherein based on all bits of the bitmask corresponding to each of the cache lines within the one of the plurality of first memory regions being set, further comprising migrating information of the corresponding entry into the second table.
17. The method of claim 12, wherein the second table comprises plural entries corresponding to the plurality of second memory regions, wherein each entry comprises a tag corresponding to an address of one of the plurality of second memory regions and a bitmask.
18. The method of claim 17, wherein the bitmask of each entry of the second table corresponds to the group of the plurality of first memory regions contained within second memory region associated with the entry, and wherein all bits of the bitmasks of the group are set.
19. The method of claim 18, wherein the group comprises the cache lines set for more than one memory page.
20. The method of claim 19, wherein the tag comprises upper physical address bits that specify a full and contiguous N megabyte (MB) region, where N is sufficient storage for more than one memory page, wherein the bitmask is based on a quantity of the first memory regions.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0011] Various aspects of the invention can be better understood with reference to the following drawings. The components in the drawings are not necessarily to scale, with emphasis instead being placed upon clearly illustrating the principles of the present invention. Moreover, in the drawings, like reference numerals designate corresponding parts throughout the several views.
[0012]
[0013]
[0014]
[0015]
DETAILED DESCRIPTION
[0016] Certain embodiments of a cache line zero tracking method of a multi-core microprocessor, and associated systems and devices, are disclosed that augment the functionality of systems that use the knowledge of zero values to mitigate system memory accesses by using a cache line zero tracking controller (hereinafter, referred to also as simply a controller) in conjunction with plural tables (e.g., first and second tables), where the first table is used to track zero valued cache lines for a given memory region or memory page, and the second table is populated with information for each of the tracked memory pages (where all cache lines per page are zero valued) to enable tracking per row of a group or multiple pages that are zero valued.
[0017] Digressing briefly, since zeroing out cache lines may occur over plural blocks, for instance several megabytes at a time, the cost of storing information or state for many cache lines that make up a multiple or a group of pages (e.g., 12 megabytes of information or metadata) in a single look up table may offset benefits associated with mitigating accesses to system memory (e.g., dynamic random access or DRAM). In contrast, certain embodiments of a cache line zero tracking method make use of plural tables of varying granularity that facilitate the efficient storage and access of information while mitigating DRAM accesses.
[0018] Having summarized certain features of a cache line zero tracking method of the present invention, reference will now be made in detail to the description of a cache line zero tracking method as illustrated in the drawings. While a cache line zero tracking method will be described in connection with these drawings, there is no intent to limit it to the embodiment or embodiments disclosed herein. That is, while the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail sufficient for an understanding of persons skilled in the art. It should be understood, however, that the drawings and detailed description thereto are not intended to limit the invention to the particular form disclosed. On the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope of the present invention as defined by the appended claims. As used throughout this application, the word “may” is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include”, “including”, and “includes” (and similarly with “comprise”, “comprising”, and “comprises”) mean including (comprising), but not limited to.
[0019] Various units, modules, circuits, logic, or other components may be described as “configured to” perform a task or tasks. In such contexts, “configured to” is a broad recitation of structure generally meaning “having circuitry or another physical structure that” performs, or is capable of performing, the task or tasks during operations. The circuitry may be dedicated circuitry, or more general processing circuitry operating under the control of coded instructions. That is, terms like “unit”, “module”, “circuit”, “logic”, and “component” may be used herein, in describing certain aspects or features of various implementations of the invention. It will be understood by persons skilled in the art that the corresponding features are implemented utilizing circuitry, whether it be dedicated circuitry or more general purpose circuitry operating under micro-coded instruction control.
[0020] Further, the unit/module/circuit/logic/component can be configured to perform the task even when the unit/module/circuit/logic/component is not currently in operation. Reciting a unit/module/circuit/logic/component that is configured to perform one or more tasks is expressly intended not to invoke 35 U.S.C. § 112(f) for that unit/module/circuit/logic/component. In this regard, persons of ordinary skill in the art will appreciate that the specific structure or interconnections of the circuit elements will typically be determined by a compiler of a design automation tool, such as a register transfer language (RTL) compiler. RTL compilers operate upon scripts that closely resemble assembly language code, to compile the script into a form that is used for the layout or fabrication of the ultimate circuitry.
[0021] That is, integrated circuits (such as those of the present invention) are designed using higher-level software tools to model the desired functional operation of a circuit. As is well known, “Electronic Design Automation” (or EDA) is a category of software tools for designing electronic systems, such as integrated circuits. EDA tools are also used for programming design functionality into field-programmable gate arrays (FPGAs). Hardware descriptor languages (HDLs), like Verilog and very high-speed integrated circuit (e.g., VHDL) are used to create high-level representations of a circuit, from which lower-level representations and ultimately actual wiring can be derived. Indeed, since a modern semiconductor chip can have billions of components, EDA tools are recognized as essential for their design. In practice, a circuit designer specifies operational functions using a programming language like C/C++. An EDA software tool converts that specified functionality into RTL. Then, a hardware descriptor language (e.g. Verilog) converts the RTL into a discrete netlist of gates. This netlist defines the actual circuit that is produced by, for example, a foundry. Indeed, these tools are well known and understood for their role and use in the facilitation of the design process of electronic and digital systems, and therefore need not be described herein.
[0022]
[0023] In the illustrated embodiment, a three-level cache system is employed, which includes a level-one (L1) cache, a level-two (L2) cache, and a level-three (L3) cache (also referred to as a last-level cache (LLC)). The L1 cache is separated into both a data cache and an instruction cache, respectively denoted as L1D and L1I. The L2 cache also resides on core, meaning that both the L1 cache and the L2 cache are in the same circuitry as the core of each slice. That is, each core of each slice has its own dedicated L1D, L1I, and L2 caches. Outside of the core, but within each slice is an L3 cache. In one embodiment, the L3 cache 130_0 through 130_7 (also collectively referred to herein as 130) is a distributed cache, meaning that, in this example eight-core architecture, ⅛th of the L3 cache resides in slice0 102_0, ⅛th of the L3 cache resides in slice1 102_1, etc. In one embodiment, each L1 cache is 32 k in size, each L2 cache is 256 k in size, and each slice of the L3 cache is 2 megabytes in size. Thus, the total size of the L3 cache is 16 megabytes. Note that other individual or aggregate cache sizes may be used in some embodiments.
[0024] Bus interface logic 120_0 through 120_7 is provided in each slice in order to manage communications from the various circuit components among the different slices. As illustrated in
[0025] To better illustrate certain inter and intra communications of some of the circuit components, the following example will be presented. This example illustrates communications associated with a hypothetical load miss in the core6 cache. That is, this hypothetical assumes that the processing core6 110_6 is executing code that requests a load for data at hypothetical address 1000. When such a load request is encountered, the system first performs a lookup in L1D 114_6 to see if that data exists in the L1D cache. Assuming that the data is not in the L1D cache, then a lookup is performed in the L2 cache 112_6. Again, assuming that the data is not in the L2 cache, then a lookup is performed to see if the data exists in the L3 cache. As mentioned above, the L3 cache is a distributed cache, so the system first needs to determine which slice of the L3 cache the data should reside in, if in fact it resides in the L3 cache. As is known, this process can be performed using a hashing function, which is merely the exclusive ORing of bits, to get a three-bit address (sufficient to identify which slice—slice 0 through slice 7—the data is stored in).
[0026] In keeping with the example, assume this hashing function results in an indication that the data, if present in the L3 cache, would be present in that portion of the L3 cache residing in slice7. A communication is then made from the L2 cache of slice6 102_6 through bus interfaces 120_6 and 120_7 to the L3 cache present in slice7 102_7. This communication is denoted in the figure by the encircled number 1. If the data was present in the L3 cache, then it would be communicated back from the L3 cache 130_7 to the L2 cache 112_6. However, and in this example, assume that the data is not in the L3 cache either, resulting in a cache miss. Consequently, a communication is made from the L3 cache 130_7 through bus interface7 120_7 through the un-core bus interface 162 to the off-chip memory 180, through the memory controller 164. This communication is denoted in the figure by the encircled number 2. A cache line that includes the data residing at address 1000 is then communicated, in one embodiment, from the off-chip memory 180 back through memory controller 164 and un-core bus interface 162 into the L3 cache 130_7. This communication is denoted in the figure by the encircled number 3. After that data is written into the L3 cache, it is then communicated to the requesting core, core6 110_6 through the bus interfaces 120_7 and 120_6. This communication is denoted in the figure by the encircled number 4. At this point, once the load request has been completed, in one embodiment, that data will reside in each of the caches L3, L2, and L1D. Note that the inclusion policy implied above for
[0027] Having generally described an example environment in which an embodiment of a cache line zero tracking method may be implemented, attention is directed to
[0028] The memory controller 164, as described above, is coupled to DRAM 180, and controls accesses to DRAM 180. The memory controller 164 comprises a cache line zero tracking controller 200 (hereinafter, simply referred to as controller 200), which may be implemented as a state machine or microcontroller. The memory controller 164 further comprises plural (e.g., two) tables, including table1 202 and table2 204, which as explained below, are of different tracking granularity. The controller 200 in conjunction with the table1 202 and table2 204 are used to track zero cache lines in granularity of cache lines per memory region (or memory page) and a group of memory regions (plural memory pages), respectively, as described further below. In some embodiments, additional tables may be used. The DRAM 180 is arranged as a plurality of DRAM blocks 206. In one embodiment, the size of a DRAM block 206 corresponds to the size of a smallest page supported by the microprocessor 100 virtual memory system, though in some embodiments, other sized blocks may be used. The system software, among other things, sanitizes portions of the DRAM 180, including entire DRAM blocks 206. Operating systems may sanitize memory in the granularity of a page (e.g., 4 KB) whose size is determined according to the virtual memory system supported by the microprocessor 100. Further, operating systems may sanitize memory according to plural pages collectively comprising a group, which may be of sizes 2 MB, 16 MB, 1 GB, etc., as should be appreciated by one having ordinary skill. In the description that follows, the table1 202 is referred to as a 4 kilobyte (KB) table, and the table2 204 is referred to as a 2 megabyte (MB) table, the values of 4 KB and 2 MB corresponding to the cache lines of the memory regions that each entry of the respective table tracks. Note that 4 KB and 2 MB are example values used for illustration, and that in some embodiments, other values may be used for these respective tables.
[0029]
[0030] Accordingly, certain embodiments of a cache line zero tracking method make use of two tables, the 4 KB table1 202 and the 2 MB table2 204, which represents somewhat of a hybrid of the above described approaches. Referring to
[0031] With respect to
[0032] Read requests from the interconnect 190 initiate a lookup into both the 4 KB table1 202 and the 2 MB table2 204. If there is a hit in either table 202, 204 (e.g., the page has a row and the corresponding bit is set), the memory controller 164 returns zero data immediately without performing a DRAM access. Further, when rows are evicted from either of the tables 202, 204, a state machine (e.g., the controller 200) iterates over the set bits and issues the corresponding zero-writes to DRAM 180.
[0033] Explaining the memory controller 164 further, the 4 KB table1 202 comprises a table of 4 KB regions, where each row or entry 300 is used to track into the cache lines with respect to zero, and then once an entire row 300 (e.g., the entire 4 KB region) is zeroed out, as indicated by the bitmask 304, that row is removed from the 4 KB table1 202 and its information is migrated (e.g., becomes a new row) into the 2 MB table2 204 with just one of the 512 bits set. As the operating system proceeds to zero out the next adjacent 4 KB page, or another 4 KB page in the same 2 MB region, then that row is removed from the 4 KB table1 202 and its information migrated to the 2 MB table 204 (e.g., the corresponding bit in the row of the 2 MB table2 204 that corresponds to that same 2MB region is set). Note that, initially, the tables 202, 204 are empty. As each of the rows of the 4 KB table1 202 are removed (e.g., all bits are set in a row to remove the row), the row information is migrated as a new row (if not contained within an existing 2 MB region of the 2 MB table2 204 and instead is a new entry), or the information is migrated as an update (bit set) to a corresponding bit in the 512-bit map (bitmask 310) when the location is contained within an existing 2 MB region.
[0034] In view of the above description, it should be appreciated by one having ordinary skill in the art that one embodiment of a method performed by a controller in a microprocessor, the controller configured to manage accesses to dynamic random access memory (DRAM), the method denoted in
[0035] Any process descriptions or blocks in flow diagrams should be understood as representing modules, segments, logic, or portions of code which include one or more executable instructions for implementing specific logical functions or steps in the process, and alternate implementations are included within the scope of the embodiments in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in different order, or one or more of the blocks may be omitted, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present disclosure. For instance, the method 400 in
[0036] While the invention has been illustrated and described in detail in the drawings and foregoing description, such illustration and description are to be considered illustrative or exemplary and not restrictive; the invention is not limited to the disclosed embodiments. Other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed invention, from a study of the drawings, the disclosure, and the appended claims.
[0037] Note that various combinations of the disclosed embodiments may be used, and hence reference to an embodiment or one embodiment is not meant to exclude features from that embodiment from use with features from other embodiments. In the claims, the word “comprising” does not exclude other elements or steps, and the indefinite article “a” or “an” does not exclude a plurality.