MEMORY INCLUSIVITY MANAGEMENT IN COMPUTING SYSTEMS
20220414001 · 2022-12-29
Inventors
- Ishwar AGARWAL (Redmond, WA, US)
- George Zacharias Chrysos (Portland, OR, US)
- Oscar Rosell Martinez (Barcelona, ES)
Cpc classification
G06F12/0833
PHYSICS
International classification
G06F12/0831
PHYSICS
Abstract
Techniques of memory inclusivity management are disclosed herein. One example technique includes receiving a request from a core of the CPU to write a block of data corresponding to a first cacheline to a swap buffer at a memory. In response to the request, the method can include retrieving metadata corresponding to the first cacheline that includes a bit encoding a status value indicating whether the memory block at the memory currently contains data of the first cacheline or data corresponding to a second cacheline. The first and second cachelines alternately sharing the swap buffer at the memory. When the decoded status value indicates that the memory block at the first memory currently contains the data corresponding to the first cacheline, an instruction is transmitted to the memory controller to directly write the block of data to the memory block at the first memory.
Claims
1. A method of memory inclusivity management in a computing device having a central processing unit (CPU) with multiple cores sharing a system level cache (SLC) managed by a SLC controller, a first memory managed by a memory controller, and a second memory separate from the first memory and interfaced with the CPU, the method comprising: receiving, at the SLC controller, a request from a core of the CPU to write a block of data corresponding to a first cacheline to a memory block at the first memory configured to cache data for the CPU; and in response to receiving the request to write from the core, at the SLC controller, retrieving, from the SLC, metadata corresponding to the first cacheline stored at the SLC, the metadata including a bit encoding a status value indicating whether the memory block at the first memory currently contains data corresponding to the first cacheline; decoding the status value of the bit in the retrieved metadata corresponding to the first cacheline to determine whether the memory block at the first memory currently contains the data corresponding to the first cacheline or data corresponding to a second cacheline alternately sharing the memory block at the first memory with the first cacheline; and when the decoded status value indicates that the memory block at the first memory currently contains the data corresponding to the first cacheline, transmitting the block of data to the memory controller along with an instruction to directly write the block of data to the memory block at the first memory.
2. The method of claim 1, further comprising: when the decoded status value indicates that the memory block at the first memory currently does not contain the data corresponding to the first cacheline, transmitting the block of data to the memory controller along with an indicator indicating that the memory block at the first memory may not currently contain the data corresponding to the first cacheline.
3. The method of claim 1, further comprising: when the decoded status value indicates that the memory block at the first memory currently does not contain the data corresponding to the first cacheline in the request, transmitting the block of data to the memory controller along with an indicator indicating that the memory block at the first memory may not currently contain the data corresponding to the first cacheline; and upon receiving the block of data and the indicator, at the memory controller, determining a location at the second memory currently storing the data of the first cacheline without writing the received block of data to the memory block at the first memory.
4. The method of claim 1, further comprising: when the decoded status value indicates that the memory block at the first memory currently does not contain the data corresponding to the first cacheline, transmitting the block of data to the memory controller along with an indicator indicating that the memory block at the first memory may not currently contain the data corresponding to the first cacheline; and upon receiving the block of data and the indicator, at the memory controller, retrieving data currently stored in the memory block at the first memory; determining, based on the retrieved data, a location at the second memory currently storing data of the first cacheline; and forwarding the block of data to be stored at the determined location at the second memory without writing the block of data to the memory block at the first memory.
5. The method of claim 1 wherein: the request is a first request; the bit is a first bit of the metadata; and the method further includes: receiving, at the SLC controller, a second request to read data of the second cacheline from the memory block of the first memory; and upon receiving the second request, at the SLC controller, retrieving a copy of the data of the second cacheline from the memory controller; and modifying a status value of a second bit of the metadata stored at the SLC to indicate that the memory block at the first memory contains the data of the second cacheline.
6. The method of claim 1 wherein: the request is a first request; the bit is a first bit of the metadata; and the method further includes: receiving, at the SLC controller, a second request to read data of the second cacheline from the memory block of the first memory; and upon receiving the second request, at the SLC controller, retrieving a copy of the data of the second cacheline from the memory controller; storing the retrieved copy of the data of the second cacheline at the SLC; and modifying a status value of a second bit of the metadata stored at the SLC to indicate that the memory block at the first memory currently contains the data of the second cacheline.
7. The method of claim 1 wherein: the request is a first request; the bit is a first bit of the metadata; and the method further includes: receiving, at the SLC controller, a second request to read data of the second cacheline from the memory block of the first memory; and upon receiving the second request, at the SLC controller, modifying the status value of the first bit of the metadata stored at the SLC to indicate that the memory block at the first memory currently may not contain the data of the first cacheline.
8. The method of claim 1 wherein: the request is a first request; the bit is a first bit of the metadata; and the method further includes: receiving, at the SLC controller, a second request to read data of a second cacheline from the memory block of the first memory; and upon receiving the second request, at the SLC controller, retrieving a copy of the data of the second cacheline from the memory controller; modifying the status value of the first bit of the metadata stored at the SLC to indicate that the memory block at the first memory currently does not contain the data of the first cacheline; and modifying a status value of a second bit of the metadata stored at the SLC to indicate that the memory block at the first memory currently contains the data of the second cacheline.
9. The method of claim 1, further comprising: in response to receiving the request to write from the core, hashing at least a portion of the request such that the data and the metadata of the first cacheline and the second cacheline are stored in a single SLC slice in the SLC.
10. A computing device, comprising: a central processing unit (CPU) with multiple cores, a system level cache (SLC) shared by the multiple cores, and a SLC controller configured to manage the SLC; a first memory operatively coupled to the CPU; a memory controller configured to manage the first memory; and a second memory separate from the first memory and interfaced with the CPU, wherein the SLC controller including instructions executable to cause the SLC controller to: receive a request from a core of the CPU to write a block of data corresponding to a first cacheline to a memory block at the first memory configured to cache data for the CPU; and in response to receiving the request to write from the core, retrieve, from the SLC, metadata corresponding to the first cacheline stored at the SLC, the metadata including a bit encoding a status value indicating whether the memory block at the first memory currently contains data corresponding to the first cacheline; decode the status value of the bit in the retrieved metadata corresponding to the first cacheline to determine whether the memory block at the first memory currently contains the data corresponding to the first cacheline or data corresponding to a second cacheline alternately sharing the memory block at the first memory with the first cacheline; and when the decoded status value indicates that the memory block at the first memory currently contains the data corresponding to the first cacheline, transmit the block of data to the memory controller along with an instruction to directly write the block of data to the memory block at the first memory.
11. The computing device of claim 10 wherein the SLC controller includes additional instructions executable to cause the SLC controller to transmit the block of data to the memory controller along with an indicator indicating that the memory block at the first memory may not currently contain the data corresponding to the first cacheline when the decoded status value indicates that the memory block at the first memory currently does not contain the data corresponding to the first cacheline.
12. The computing device of claim 10 wherein: the request is a first request; the bit is a first bit of the metadata; and the SLC controller includes additional instructions executable to cause the SLC controller to: receive, at the SLC controller, a second request to read data of the second cacheline from the memory block of the first memory; and upon receiving the second request, retrieve a copy of the data of the second cacheline from the memory controller; and modify a status value of a second bit of the metadata stored at the SLC to indicate that the memory block at the first memory contains the data of the second cacheline.
13. The computing device of claim 10 wherein: the request is a first request; the bit is a first bit of the metadata; and the SLC controller includes additional instructions executable to cause the SLC controller to: receive, at the SLC controller, a second request to read data of the second cacheline from the memory block of the first memory; and upon receiving the second request, at the SLC controller, retrieve a copy of the data of the second cacheline from the memory controller; store the retrieved copy of the data of the second cacheline at the SLC; and modify a status value of a second bit of the metadata stored at the SLC to indicate that the memory block at the first memory currently contains the data of the second cacheline.
14. The computing device of claim 10 wherein: the request is a first request; the bit is a first bit of the metadata; and the SLC controller includes additional instructions executable to cause the SLC controller to: receive, at the SLC controller, a second request to read data of the second cacheline from the memory block of the first memory; and upon receiving the second request, at the SLC controller, modify the status value of the first bit of the metadata stored at the SLC to indicate that the memory block at the first memory currently may not contain the data of the first cacheline.
15. The computing device of claim 10 wherein: the request is a first request; the bit is a first bit of the metadata; and the SLC controller includes additional instructions executable to cause the SLC controller to: receive, at the SLC controller, a second request to read data of a second cacheline from the memory block of the first memory; and upon receiving the second request, at the SLC controller, modify the status value of the first bit of the metadata stored at the SLC to indicate that the memory block at the first memory currently does not contain the data of the first cacheline; and modify a status value of a second bit of the metadata stored at the SLC to indicate that the memory block at the first memory currently contains the data of the second cacheline.
16. The computing device of claim 10 wherein the SLC controller includes additional instructions executable to cause the SLC controller to hash at least a portion of the request such that the data and the metadata of the first cacheline and the second cacheline are stored in a single SLC slice in the SLC in response to receiving the request to write from the core.
17. A method of memory inclusivity management in a computing device having a central processing unit (CPU) with multiple cores sharing a system level cache (SLC) managed by a SLC controller, a first memory managed by a memory controller, and a second memory separate from the first memory and interfaced with the CPU, the method comprising: receiving, at the SLC controller, a request from a core of the CPU to write a block of data corresponding to a system memory address to a memory block at the first memory; and in response to receiving the request to write from the core, at the SLC controller, retrieving, from the SLC, metadata including one or more bits individually encoding a status value indicating whether the memory block at the first memory currently contains data corresponding to the system memory address or data corresponding to one or more additional system memory addresses alternately sharing the memory block at the first memory; determining, based on the retrieved metadata from the SLC, whether the memory block at the first memory currently contains the data corresponding to the system address in the request; and in response to determining that the memory block at the first memory currently contains the data corresponding to the system address, transmitting the block of data to the memory controller along with an instruction to directly write the block of data to the memory block at the first memory.
18. The method of claim 17, further comprising: in response to determining that the memory block at the first memory currently does not contain the data corresponding to the system memory, transmitting the block of data to the memory controller along with an indicator indicating that the memory block at the first memory may not currently contain the data corresponding to the system memory in the request to write.
19. The method of claim 17 wherein: the request is a first request; the system address is a first system address; and the method further includes: receiving, at the SLC controller, a second request to read data of a second system address from the memory block of the first memory; and upon receiving the second request, at the SLC controller, retrieving a copy of the data of the second system address from the memory controller; and modifying a value of one of the one or more bits corresponding to the second memory address in the metadata to indicate that the memory block at the first memory contains the data of the second memory address.
20. The method of claim 17 wherein: the request is a first request; the system address is a first system address; and the method further includes: receiving, at the SLC controller, a second request to read data of a second memory address from the memory block of the first memory; and upon receiving the second request, at the SLC controller, modifying the status value of the first bit of the metadata stored at the SLC to indicate that the memory block at the first memory currently does not contain the data of the first cacheline; and modifying a status value of a second bit of the metadata stored at the SLC to indicate that the memory block at the first memory currently contains the data of the second cacheline.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0014]
[0015]
[0016]
[0017]
[0018]
DETAILED DESCRIPTION
[0019] Certain embodiments of systems, devices, components, modules, routines, data structures, and processes for memory inclusivity management are described below. In the following description, specific details of components are included to provide a thorough understanding of certain embodiments of the disclosed technology. A person skilled in the relevant art will also understand that the technology can have additional embodiments. The technology can also be practiced without several of the details of the embodiments described below with reference to
[0020] As used herein, the term “distributed computing system” generally refers to an interconnected computer system having multiple network nodes that interconnect a plurality of servers or hosts to one another and/or to external networks (e.g., the Internet). The term “network node” generally refers to a physical network device. Example network nodes include routers, switches, hubs, bridges, load balancers, security gateways, or firewalls. A “host” generally refers to a physical computing device. In certain embodiments, a host can be configured to implement, for instance, one or more virtual machines, virtual switches, or other suitable virtualized components. For example, a host can include a server having a hypervisor configured to support one or more virtual machines, virtual switches, or other suitable types of virtual components. In other embodiments, a host can be configured to execute suitable applications directly on top of an operating system.
[0021] A computer network can be conceptually divided into an overlay network implemented over an underlay network in certain implementations. An “overlay network” generally refers to an abstracted network implemented over and operating on top of an underlay network. The underlay network can include multiple physical network nodes interconnected with one another. An overlay network can include one or more virtual networks. A “virtual network” generally refers to an abstraction of a portion of the underlay network in the overlay network. A virtual network can include one or more virtual end points referred to as “tenant sites” individually used by a user or “tenant” to access the virtual network and associated computing, storage, or other suitable resources. A tenant site can host one or more tenant end points (“TEPs”), for example, virtual machines. The virtual networks can interconnect multiple TEPs on different hosts. Virtual network nodes in the overlay network can be connected to one another by virtual links individually corresponding to one or more network routes along one or more physical network nodes in the underlay network. In other implementations, a computer network can only include the underlay network.
[0022] Also used herein, the term “near memory” generally refers to memory that is physically proximate to a processor (e.g., a CPU) than other “far memory” at a distance from the processor. For example, near memory can include one or more DDR SDRAM dies that are incorporated into an Integrated Circuit (IC) component package with one or more CPU dies via an interposer and/or through silicon vias. In contrast, far memory can include additional memory on remote computing devices, accelerators, memory buffers, or smart I/O devices that the CPU can interface with via CXL or other suitable types of protocols. For instance, in datacenters, multiple memory devices on multiple servers/server blades may be pooled to be allocatable to a single CPU on one of the servers/server blades. The CPU can access the allocated such far memory via a computer network in datacenters.
[0023] In certain implementations, a CPU can include multiple individual processors or cores integrated into an electronic package. The cores can individually include one or more arithmetic logic units, floating-point units, L1/L2 cache, and/or other suitable components. The electronic package can also include one or more peripheral components configured to facilitate operations of the cores. Examples of such peripheral components can include QuickPath® Interconnect controllers, system level cache or SLC (e.g., L3 cache) shared by the multiple cores in the CPU, snoop agent pipeline, SLC controllers configured to manage the SLC, and/or other suitable components.
[0024] Also used herein, a “cacheline” generally refers to a unit of data transferred between cache (e.g., L1, L2, or SLC) and memory (e.g., near or far memory). A cacheline can include 32, 64, 128, or other suitable numbers of bytes. A core can read or write an entire cacheline when any location in the cacheline is read or written. In certain implementations, multiple cachelines can be configured to alternately share a memory block at the near memory when the near memory is configured as a swap buffer instead of a dedicated cache for the CPU. The multiple cachelines that alternately share a memory block at the near memory can be referred to as a cache set. As such, at different times, the memory block at the near memory can contain data for one of the multiple cachelines but not the others.
[0025] In certain implementations, multiple cachelines of a cache set can be configured (e.g., via hashing) to be stored in a single SLC memory space referred to as SLC slice individually having a data array and a tag array. The data array can be configured to store a copy of data for the individual cachelines while the tag array can include multiple bits configured to indicate certain attributes of the data stored in the corresponding data array. For example, in accordance with embodiments of the disclosed technology, the tag array can be configured to include a validity bit and an inclusivity bit for each cachelines. In other embodiments, the tag array can include the inclusivity bit without the validity bit or having other suitable bits and/or configurations. As described in more detail herein, the inclusivity bits can be configured to monitor inclusivity status in the cache system and modify operations in the computing device accordingly.
[0026]
[0027] As shown in
[0028] The hosts 106 can individually be configured to provide computing, storage, and/or other suitable cloud or other suitable types of computing services to the users 101. For example, as described in more detail below with reference to
[0029] The client devices 102 can each include a computing device that facilitates the users 101 to access computing services provided by the hosts 106 via the underlay network 108. In the illustrated embodiment, the client devices 102 individually include a desktop computer. In other embodiments, the client devices 102 can also include laptop computers, tablet computers, smartphones, or other suitable computing devices. Though three users 101 are shown in
[0030]
[0031] In
[0032] Components within a system may take different forms within the system. As one example, a system comprising a first component, a second component and a third component can, without limitation, encompass a system that has the first component being a property in source code, the second component being a binary compiled library, and the third component being a thread created at runtime. The computer program, procedure, or process may be compiled into object, intermediate, or machine code and presented for execution by one or more processors of a personal computer, a network server, a laptop computer, a smartphone, and/or other suitable computing devices.
[0033] Equally, components may include hardware circuitry. A person of ordinary skill in the art would recognize that hardware may be considered fossilized software, and software may be considered liquefied hardware. As just one example, software instructions in a component may be burned to a Programmable Logic Array circuit or may be designed as a hardware circuit with appropriate integrated circuits. Equally, hardware may be emulated by software. Various implementations of source, intermediate, and/or object code and associated data may be stored in a computer memory that includes read-only memory, random-access memory, magnetic disk storage media, optical storage media, flash memory devices, and/or other suitable computer readable storage media excluding propagated signals.
[0034] As shown in
[0035] The CPU 132 can include a microprocessor, caches, and/or other suitable logic devices. The memory 134 can include volatile and/or nonvolatile media (e.g., ROM; RAM, magnetic disk storage media; optical storage media; flash memory devices, and/or other suitable storage media) and/or other types of computer-readable storage media configured to store data received from, as well as instructions for, the CPU 132 (e.g., instructions for performing the methods discussed below with reference to
[0036] The source host 106a and the destination host 106b can individually contain instructions in the memory 134 executable by the CPUs 132 to cause the individual CPUs 132 to provide a hypervisor 140 (identified individually as first and second hypervisors 140a and 140b) and an operating system 141 (identified individually as first and second operating systems 141a and 141b). Even though the hypervisor 140 and the operating system 141 are shown as separate components, in other embodiments, the hypervisor 140 can operate on top of the operating system 141 executing on the hosts 106 or a firmware component of the hosts 106.
[0037] The hypervisors 140 can individually be configured to generate, monitor, terminate, and/or otherwise manage one or more virtual machines 144 organized into tenant sites 142. For example, as shown in
[0038] Also shown in
[0039] The virtual machines 144 can be configured to execute one or more applications 147 to provide suitable cloud or other suitable types of computing services to the users 101 (
[0040] Communications of each of the virtual networks 146 can be isolated from other virtual networks 146. In certain embodiments, communications can be allowed to cross from one virtual network 146 to another through a security gateway or otherwise in a controlled fashion. A virtual network address can correspond to one of the virtual machines 144 in a particular virtual network 146. Thus, different virtual networks 146 can use one or more virtual network addresses that are the same. Example virtual network addresses can include IP addresses, MAC addresses, and/or other suitable addresses. To facilitate communications among the virtual machines 144, virtual switches (not shown) can be configured to switch or filter packets directed to different virtual machines 144 via the network interface card 136 and facilitated by the packet processor 138.
[0041] As shown in
[0042] In certain implementations, a packet processor 138 can be interconnected to and/or integrated with the NIC 136 to facilitate network traffic operations for enforcing communications security, performing network virtualization, translating network addresses, maintaining/limiting a communication flow state, or performing other suitable functions. In certain implementations, the packet processor 138 can include a Field-Programmable Gate Array (“FPGA”) integrated with the NIC 136.
[0043] An FPGA can include an array of logic circuits and a hierarchy of reconfigurable interconnects that allow the logic circuits to be “wired together” like logic gates by a user after manufacturing. As such, a user 101 can configure logic blocks in FPGAs to perform complex combinational functions, or merely simple logic operations to synthetize equivalent functionality executable in hardware at much faster speeds than in software. In the illustrated embodiment, the packet processor 138 has one interface communicatively coupled to the NIC 136 and another coupled to a network switch (e.g., a Top-of-Rack or “TOR” switch) at the other. In other embodiments, the packet processor 138 can also include an Application Specific Integrated Circuit (“ASIC”), a microprocessor, or other suitable hardware circuitry.
[0044] In operation, the CPU 132 and/or a user 101 (
[0045] As such, once the packet processor 138 identifies an inbound/outbound packet as belonging to a particular flow, the packet processor 138 can apply one or more corresponding policies in the flow table before forwarding the processed packet to the NIC 136 or TOR 112. For example, as shown in
[0046] The second TOR 112b can then forward the packet to the packet processor 138 at the destination hosts 106b and 106b′ to be processed according to other policies in another flow table at the destination hosts 106b and 106b′. If the packet processor 138 cannot identify a packet as belonging to any flow, the packet processor 138 can forward the packet to the CPU 132 via the NIC 136 for exception processing. In another example, when the first TOR 112a receives an inbound packet, for instance, from the destination host 106b via the second TOR 112b, the first TOR 112a can forward the packet to the packet processor 138 to be processed according to a policy associated with a flow of the packet. The packet processor 138 can then forward the processed packet to the NIC 136 to be forwarded to, for instance, the application 147 or the virtual machine 144.
[0047] In certain embodiments, the memory 134 can include both near memory 170 and far memory 172 (shown in
[0048] In certain implementations, L1, L2, SLC, and the near memory 170 can form a cache system with multiple levels of caches in a hierarchical manner. For example, a core in the CPU 132 can attempt to locate a cacheline in L1, L2, SLC, and the near memory 170 in a sequential manner. However, when the near memory 170 is configured as a swap buffer for the far memory 172 instead of being a dedicated cache memory for the CPU 132, maintaining inclusivity in the cache system may be difficult. One solution for the foregoing difficulty is to configure the cache system to enforce inclusivity in all levels of the caches via back invalidation. Such invalidation though can introduce substantial operational complexity and increase execution latency because a frequently used cacheline may be invalidated due to read/write operations in the swap buffer. Thus, enforcing inclusivity in the host 106 may negatively impact system performance.
[0049] Several embodiments of the disclosed technology can address the foregoing impact on system performance when implementing the near memory as a swap buffer in the computer device. In certain embodiments, sections of data (e.g., one or more cachelines) that alternately share a memory block of the near memory 170 can be grouped into a dataset or cache set. A hash function can be implemented at, for example, a SLC controller such that all cachelines in a cache set is stored in a single SLC slice. During operation, the SLC controller can be configured to track a status of inclusivity in the cache system when reading or writing data to the cachelines and modifying operations in the cache system in accordance with the status of the inclusivity in the cache system, as described in more detail below with reference to
[0050]
[0051] In the illustrated embodiment, the CPU 132 can include multiple cores 133 (illustrated as Core 1, Core 2, . . . , Core N) individually having L1/L2 cache 139. The host 106 can also include a SLC controller 150 operatively coupled to the CPU 132 and configured to manage operations of SLC 151. In the illustrated embodiment, the SLC 151 is partitioned into multiple SLC slices 154 (illustrated as SLC Slice 1, SLC Slice 2, . . . , SLC Slice M) individually configured to contain data and metadata of one or more datasets such as cache sets 158. Each cache set 158 can include a tag array 155 and a data array 156 (only one cache set 158 is illustrated for brevity). Though only one cache set 158 is shown as being stored at SLC Slice M in
[0052] In certain implementations, the memory controller 135 can be configured to operate the near memory 170 as a swap buffer 137 for the far memory 172 instead of being a dedicated cache memory for the CPU 132. As such, the CPU 132 can continue caching data in the near memory 170 while the near memory 170 and the far memory 172 are exposed to the operating system 141 (
[0053] In certain implementations, several bits in the metadata portion 159 in the near memory 170 can be configured to indicate (1) which section of the range of system memory the near memory 170 current holds; and (2) locations of additional sections of the range of system memory in the far memory 172. In the example with four sections of system memory, eight bits in the metadata portion 159 in the near memory 170 can be configured to indicate the foregoing information. For instance, a first pair of first two bits can be configured to indicate which section is currently held in the near memory 170 as follows:
TABLE-US-00001 Bit 1 Bit 2 Section ID 0 0 A 0 1 B 1 0 C 1 1 D
As such, the memory controller 135 can readily determine that the near memory 170 contains data from section A of the system memory when the Bit 1 and Bit 2 contains zero and zero, respectively, as illustrated in
[0054] While the first two bits correspond to the near memory 170, the additional six bits can be subdivided into three pairs individually corresponding to a location in the far memory 172. For instance, the second, third, and four pairs can each correspond to a first, second, or third locations 172a-172c in the far memory 172, as follows:
TABLE-US-00002 First pair (Bit 1 and Bit 2) Near memory Second pair (Bit 3 and Bit 4) First location in far memory Third pair (Bit 5 and Bit 6) Second location in far memory Fourth pair (Bit 7 and Bit 8) Third location in far memory
[0055] As such, the memory controller 135 can readily determine where data from a particular section of the system memory is in the far memory 172 even though the data is not currently in the near memory 170. For instance, when the second pair (i.e., Bit 3 and Bit 4) contains (1, 1), the memory controller 135 can be configured to determine that data corresponding to Section D of the system memory is in third location 172c in the far memory 172. When the third pair (i.e., Bit 5 and Bit 6) contains (1, 0), the memory controller 135 can be configured to determine that data corresponding to Section C of the system memory is in second location 172b in the far memory 172. When the fourth pair (i.e., Bit 7 and Bit 8) contains (0, 1), the memory controller 135 can be configured to determine that data corresponding to Section B of the system memory is in the first location 172a in the far memory 172, as illustrated in
[0056] Using the data from the metadata portion 159 in the near memory 170, the memory controller 135 can be configured to manage swap operations between the near memory 170 and the far memory 172 using the near memory 170 as a swap buffer 137. For example, during a read operation, the CPU 132 can issue a command to the memory controller 135 to read data corresponding to section A when such data is not currently residing in the SLC 151, L1, or L2 cache. In response, the memory controller 135 can be configured to read from the near memory 170 to retrieve data from both the data portion 157 and the metadata portion 159 of the near memory 170. The memory controller 135 can then be configured to determine which section of the system memory the retrieved data corresponds to using the tables above, and whether the determined section matches a target section to be read. For example, when the target section is section A, and the first two bits from the metadata portion 159 contains (0, 0), then the memory controller 135 can be configured to determine that the retrieved data is from section A (e.g., “A data 162a”). Thus, the memory controller 135 can forward the retrieved A data 162a to a requesting entity, such as an application executed by the CPU 132.
[0057] On the other hand, when the first two bits from the metadata portion contains (0, 1) instead of (0, 0), the memory controller 135 can be configured to determine that the retrieved data belongs to section B (referred to as “B data 162b”), not A data 162a. The memory controller 135 can then continue to examine the additional bits in the metadata portion 159 to determine which pair of bits contains (0, 0). For example, when the second pair (Bit 3 and Bit 4) from the metadata portion contains (0, 0), then the memory controller 135 can be configured to determine that A data 162a is located at the first location 172a in the far memory 172. In response, the memory controller 135 can be configured to read A data 162a from the first location 172a in the far memory 172 and provide the A data 162a to the requesting entity. The memory controller 135 can then be configured to write the retrieved A data 162a into the near memory and the previously retrieved B data 162b to the first section 172a in the far memory 172. The memory controller 135 can also be configured to modify the bits in the metadata portion 159 in the near memory 170 to reflect the swapping between section A and section B. Though particular mechanisms are described above to implement the swapping operations between the near memory 170 and the far memory 172, in other implementations, the memory controller 135 can be configured to perform the swapping operations in other suitable manners.
[0058] As shown in
[0059] Using the inclusivity bits, the SLC controller 150 can be configured to monitor inclusivity status in the cache system such as the swap buffer 137 and modify operations in the host 106 accordingly. For example, as shown in
[0060] Upon receiving the request 160 to read data A 162a, the memory controller 135 can be configured to determine whether data A 162a is currently in the swap buffer 137 using metadata in the metadata portion 159, as described above. In the illustrated example, data A 162a is indeed in the swap buffer 137. As such, the memory controller 135 reads data A 162a from the near memory 170 and transmits data A 162a to the SLC controller 150, as shown in
[0061] As shown in
[0062] Under other operational scenarios, however, certain intervening operations may cause the swap buffer 137 to contain data for other sections instead of for section A. For example, as shown in
[0063] In response to determining that data B 162b is currently not available at the SLC Slice M, the SLC controller 150 can be configured to request the memory controller 135 for a copy of data B 162b. In response, memory controller 135 can perform the swap operations described above to read data B 162b from the first location 172a in the far memory 172, store a copy of data B 162b in the swap buffer 137, provide a copy of data B 162b to the SLC controller 150, and write a copy of data A 162a to the first location 172a in the far memory 172. Upon receiving the copy of data B 162b, the SLC controller 150 can be configured to set the validity and inclusivity bits for section B as true while modifying the inclusivity bit for section A to not true, as shown in
[0064] As shown in
[0065] Several embodiments of the disclosed technology above can thus improve system performance of the host 106 when the near memory 170 is used as a swap buffer 137 instead of a dedicated cache for the CPU 132. Using performance simulations, the inventors have recognized that large numbers of operations in a host 106 do not involve intervening read/write operations as those shown in
[0066]
[0067] As shown in
[0068]
[0069] As shown in
[0070]
[0071] Depending on the desired configuration, the processor 304 can be of any type including but not limited to a microprocessor (μP), a microcontroller (μC), a digital signal processor (DSP), or any combination thereof. The processor 304 can include one more level of caching, such as a level-one cache 310 and a level-two cache 312, a processor core 314, and registers 316. An example processor core 314 can include an arithmetic logic unit (ALU), a floating-point unit (FPU), a digital signal processing core (DSP Core), or any combination thereof. An example memory controller 318 can also be used with processor 304, or in some implementations memory controller 318 can be an internal part of processor 304.
[0072] Depending on the desired configuration, the system memory 306 can be of any type including but not limited to volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.) or any combination thereof. The system memory 306 can include an operating system 320, one or more applications 322, and program data 324. As shown in
[0073] The computing device 300 can have additional features or functionality, and additional interfaces to facilitate communications between basic configuration 302 and any other devices and interfaces. For example, a bus/interface controller 330 can be used to facilitate communications between the basic configuration 302 and one or more data storage devices 332 via a storage interface bus 334. The data storage devices 332 can be removable storage devices 336, non-removable storage devices 338, or a combination thereof. Examples of removable storage and non-removable storage devices include magnetic disk devices such as flexible disk drives and hard-disk drives (HDD), optical disk drives such as compact disk (CD) drives or digital versatile disk (DVD) drives, solid state drives (SSD), and tape drives to name a few. Example computer storage media can include volatile and nonvolatile, removable, and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. The term “computer readable storage media” or “computer readable storage device” excludes propagated signals and communication media.
[0074] The system memory 306, removable storage devices 336, and non-removable storage devices 338 are examples of computer readable storage media. Computer readable storage media include, but not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other media which can be used to store the desired information, and which can be accessed by computing device 300. Any such computer readable storage media can be a part of computing device 300. The term “computer readable storage medium” excludes propagated signals and communication media.
[0075] The computing device 300 can also include an interface bus 340 for facilitating communication from various interface devices (e.g., output devices 342, peripheral interfaces 344, and communication devices 346) to the basic configuration 302 via bus/interface controller 330. Example output devices 342 include a graphics processing unit 348 and an audio processing unit 350, which can be configured to communicate to various external devices such as a display or speakers via one or more NV ports 352. Example peripheral interfaces 344 include a serial interface controller 354 or a parallel interface controller 356, which can be configured to communicate with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device, etc.) or other peripheral devices (e.g., printer, scanner, etc.) via one or more I/O ports 358. An example communication device 346 includes a network controller 360, which can be arranged to facilitate communications with one or more other computing devices 362 over a network communication link via one or more communication ports 364.
[0076] The network communication link can be one example of a communication media. Communication media can typically be embodied by computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and can include any information delivery media. A “modulated data signal” can be a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media can include wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, radio frequency (RF), microwave, infrared (IR) and other wireless media. The term computer readable media as used herein can include both storage media and communication media.
[0077] The computing device 300 can be implemented as a portion of a small-form factor portable (or mobile) electronic device such as a cell phone, a personal cacheline Assistant (PDA), a personal media player device, a wireless web-watch device, a personal headset device, an application specific device, or a hybrid device that include any of the above functions. The computing device 300 can also be implemented as a personal computer including both laptop computer and non-laptop computer configurations.
[0078] From the foregoing, it will be appreciated that specific embodiments of the disclosure have been described herein for purposes of illustration, but that various modifications may be made without deviating from the disclosure. In addition, many of the elements of one embodiment may be combined with other embodiments in addition to or in-lieu of the elements of the other embodiments. Accordingly, the technology is not limited except as by the appended claims.