Patent classifications
G06F2212/2542
A METHOD AND APPARATUS TO USE DRAM AS A CACHE FOR SLOW BYTE-ADDRESSIBLE MEMORY FOR EFFICIENT CLOUD APPLICATIONS
Various embodiments are generally directed to virtualized systems. A first guest memory page may be identified based at least in part on a number of accesses to a page table entry for the first guest memory page in a page table by an application executing in a virtual machine (VM) on the processor, the first guest memory page corresponding to a first byte-addressable memory. The execution of the VM and the application on the processor may be paused. The first guest memory page may be migrated to a target memory page in a second byte-addressable memory, the target memory page comprising one of a target host memory page and a target guest memory page, the second byte-addressable memory having an access speed faster than an access speed of the first byte-addressable memory.
Hardware-Assisted Memory Disaggregation with Recovery from Network Failures Using Non-Volatile Memory
Techniques for implementing hardware-assisted memory disaggregation with recovery from network failures/problems are provided. In one set of embodiments, a hardware controller of a computer system can maintain a copy of a “remote memory” of the computer system (i.e., a section of the physical memory address space of the computer system that maps to a portion of the physical system memory of a remote computer system) in a local backup memory. The backup memory may be implemented using a non-volatile memory that is slower, but also less expensive, than conventional dynamic random-access memory (DRAM). Then, if the hardware controller is unable to retrieve data in the remote memory from the remote computer system within a specified time window due to, e.g., a network failure or other problem, the hardware controller can retrieve the data from the backup memory, thereby avoiding a hardware error condition (and potential application/system crash).
Ternary content addressable memory-enhanced cache coherency acceleration
A system and method for cache coherency within multiprocessor environments is provided. Each node controller of a plurality of nodes within a multiprocessor system receives a cache coherency protocol request from local processor sockets and other node controller(s). A ternary content addressable memory (TCAM) accelerator in the node controller determines if the cache coherency protocol request comprises a snoop request and, if it is determined to be a snoop request, searching the TCAM based on an address within the cache coherency protocol request. In response to detecting only one match between an entry of the TCAM and the received snoop request, sending a response to the requesting local processor a response without having to access a coherency directory.
ACCELERATION MANAGEMENT NODE, ACCELERATION NODE, CLIENT, AND METHOD
Embodiments of the present application provide an acceleration management node. The acceleration management node separately receives acceleration device information of all acceleration devices. The acceleration device information includes an algorithm type, an acceleration bandwidth or non-uniform memory access architecture (NUMA). The acceleration management node obtains an invocation request from a client. The acceleration management node queries the acceleration device information to determine, from all the acceleration devices of the at least one acceleration node, a target acceleration device matching the invocation request. The acceleration management node further instructs a target acceleration node to respond to the invocation request.
Object memory instruction set
Embodiments of the present invention are directed to an instruction set of an object memory fabric. This object memory fabric instruction set can be used to define arbitrary, parallel functionality such as: direct object address manipulation and generation without the overhead of complex address translation and software layers to manage differing address space; direct object authentication with no runtime overhead that can be set based on secure 3rd party authentication software; object related memory computing in which, as objects move between nodes, the computing can move with them; and parallelism that is dynamically and transparent based on scale and activity. These instructions are divided into three conceptual classes: memory reference including load, store, and special memory fabric instructions; control flow including fork, join, and branches; and execute including arithmetic and comparison instructions.
Runtime allocation and utilization of persistent memory as volatile memory
The described technologies enable a computing device to allocate at least a portion of its persistent memory as volatile memory during runtime. At least some implementations create a file in the persistent memory of the computing device. The file is created in the persistent memory of the computing device during runtime of a virtual machine (VM) hosted by the computing device. The file may be allocated to the VM. The file allocated to the VM may be used as volatile memory. For example, the VM may use the file to store temporary data (e.g., volatile data). In some implementations, the temporary data is associated with an application executing in the VM.
Identification of a computing device accessing a shared memory
A method for identifying, in a system including two or more computing devices that are able to communicate with each other, with each computing device having with a cache and connected to a corresponding memory, a computing device accessing one of the memories, includes monitoring memory access to any of the memories; monitoring cache coherency commands between computing devices; and identifying the computing device accessing one of the memories by using information related to the memory access and cache coherency commands.
Techniques for Concurrently Supporting Virtual NUMA and CPU/Memory Hot-Add in a Virtual Machine
Techniques for concurrently supporting virtual non-uniform memory access (virtual NUMA) and CPU/memory hot-add in a virtual machine (VM) are provided. In one set of embodiments, a hypervisor of a host system can compute a node size for a virtual NUMA topology of the VM, where the node size indicates a maximum number of virtual central processing units (vCPUs) and a maximum amount of memory to be included in each virtual NUMA node. The hypervisor can further build and expose the virtual NUMA topology to the VM. Then, at a time of receiving a request to hot-add a new vCPU or memory region to the VM, the hypervisor can check whether all existing nodes in the virtual NUMA topology have reached the maximum number of vCPUs or maximum amount of memory, per the computed node size. If so, the hypervisor can create a new node with the new vCPU or memory region and add the new node to the virtual NUMA topology.
LOCK-FREE WORK-STEALING THREAD SCHEDULER
Systems and methods are provided for lock-free thread scheduling. Threads may be placed in a ring buffer shared by all computer processing units (CPUs), e.g., in a node. A thread assigned to a CPU may be placed in the CPU's local run queue. However, when a CPU's local run queue is cleared, that CPU checks the shared ring buffer to determine if any threads are waiting to run on that CPU, and if so, the CPU pulls a batch of threads related to that ready-to-run thread to execute. If not, an idle CPU randomly selects another CPU to steak threads from, and the idle CPU attempts to dequeue a thread batch associated with the CPU from the shared ring buffer. Polling may be handled through the use of a shared poller array to dynamically distribute polling across multiple CPUs.
SYSTEM COHERENCY PROTOCOL
Embodiments herein described a coherency protocol for a distributed computing topology that permits for large stalls on various interfaces. In one embodiment, the computing topology includes multiple boards which each contain multiple processors. When a particular core on a processor wants access to data that is not currently stored in its cache, the core can first initiate a request to search for the cache line in the caches for other cores on the same processor. If the cache line is not found, the cache coherency protocol permits the processor to then broadcast a request to the other processors on the same board. If a processor on the same board does not have the data, the processor can then broadcast the request to the other boards in the system. The processors in those boards can then search their caches to identify the data.