G06F2212/2542

Memory Setting Method and Apparatus
20220317889 · 2022-10-06 ·

A memory setting method and apparatus the method including obtaining, by a processor that is in a non-uniform memory access architecture (NUMA) system and that has at least two memories, performance of the at least two memories upon the processor starting, and setting, based on the performance of the at least two memories, at least one of the at least two memories as a local memory, and setting, based on the performance, at least one of the at least two memories as a remote memory, where performance of the local memory is better than performance of the remote memory.

Systems and methods for improving cache efficiency and utilization

Systems and methods for improving cache efficiency and utilization are disclosed. In one embodiment, a graphics processor includes processing resources to perform graphics operations and a cache controller of a cache coupled to the processing resources. The cache controller is configured to control cache priority by determining whether default settings or an instruction will control cache operations for the cache.

TASK SCHEDULING FOR MACHINE-LEARNING WORKLOADS
20230136661 · 2023-05-04 ·

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, are described for scheduling tasks of ML workloads. A system receives requests to perform the workloads and determines, based on the requests, resource requirements to perform the workloads. The system includes multiple hosts and each host includes multiple accelerators. The system determines a quantity of hosts assigned to execute tasks of the workload based on the resource requirement and the accelerators for each host. For each host in the quantity of hosts, the system generates a task specification based on a memory access topology of the host. The specification specifies the task to be executed at the host using resources of the host that include the multiple accelerators. The system provides the task specifications to the hosts and performs the workloads when each host executes assigned tasks specified in the task specifications for the host.

Hardware-Assisted Memory Disaggregation with Recovery from Network Failures Using Non-Volatile Memory
20230205649 · 2023-06-29 ·

Techniques for implementing hardware-assisted memory disaggregation with recovery from network failures/problems are provided. In one set of embodiments, a hardware controller of a computer system can maintain a copy of a “remote memory” of the computer system (i.e., a section of the physical memory address space of the computer system that maps to a portion of the physical system memory of a remote computer system) in a local backup memory. The backup memory may be implemented using a non-volatile memory that is slower, but also less expensive, than conventional dynamic random-access memory (DRAM). Then, if the hardware controller is unable to retrieve data in the remote memory from the remote computer system within a specified time window due to, e.g., a network failure or other problem, the hardware controller can retrieve the data from the backup memory, thereby avoiding a hardware error condition (and potential application/system crash).

Method and system for constructing persistent memory index in non-uniform memory access architecture
11687392 · 2023-06-27 · ·

A method for constructing a persistent memory index in a non-uniform memory access architecture includes: maintaining partial persistent views in a persistent memory and maintaining a global volatile view in a DRAM; an underlying persistent memory index processing a request in a foreground thread when cold data is accessed; when hot data is accessed, reading a key-value pair for a piece of hot data in the global volatile view in response to a query operation carried in the request, and in response to an insert/update/delete operation carried in the request, updating a local partial persistent view and the global volatile view; and in response to a hotspot migration, a background thread generating new partial persistent views and a new global volatile view, and recycling the partial persistent views and the global volatile view for old hot data into the underlying persistent memory index.

Techniques for concurrently supporting virtual NUMA and CPU/memory hot-add in a virtual machine
11687356 · 2023-06-27 · ·

Techniques for concurrently supporting virtual non-uniform memory access (virtual NUMA) and CPU/memory hot-add in a virtual machine (VM) are provided. In one set of embodiments, a hypervisor of a host system can compute a node size for a virtual NUMA topology of the VM, where the node size indicates a maximum number of virtual central processing units (vCPUs) and a maximum amount of memory to be included in each virtual NUMA node. The hypervisor can further build and expose the virtual NUMA topology to the VM. Then, at a time of receiving a request to hot-add a new vCPU or memory region to the VM, the hypervisor can check whether all existing nodes in the virtual NUMA topology have reached the maximum number of vCPUs or maximum amount of memory, per the computed node size. If so, the hypervisor can create a new node with the new vCPU or memory region and add the new node to the virtual NUMA topology.

GRAPHICS PROCESSORS AND GRAPHICS PROCESSING UNITS HAVING DOT PRODUCT ACCUMULATE INSTRUCTION FOR HYBRID FLOATING POINT FORMAT

Described herein is a graphics processing unit (GPU) configured to receive an instruction having multiple operands, where the instruction is a single instruction multiple data (SIMD) instruction configured to use a bfloat16 (BF16) number format and the BF16 number format is a sixteen-bit floating point format having an eight-bit exponent. The GPU can process the instruction using the multiple operands, where to process the instruction includes to perform a multiply operation, perform an addition to a result of the multiply operation, and apply a rectified linear unit function to a result of the addition.

Data deployment determination apparatus, data deployment determination program, and data deployment determination method
09842049 · 2017-12-12 · ·

A data deployment determination apparatus includes a correlation information creation processor that creates correlation information in which addresses indicating areas in a first memory are correlated with frequency information on memory accesses for the respective addresses, from trace information on a memory access to the first memory, a time reduction calculation processor that calculates, for each of the addresses, time reduction in memory accesses to data stored in the first memory based on the correlation information when data stored in the first memory is stored in a second memory which is a memory having a larger bandwidth than the first memory, and a data deployment determination processor that determines that first data stored in the address of which the time reduction is larger than the time reduction corresponding to second data stored in the address is to be stored in the second memory in preference to the second data.

DATA ACCESS BETWEEN COMPUTING NODES

Technology for an apparatus is described. The apparatus can receive a command to copy data. The command can indicate a first address, a second address and an offset value. The apparatus can determine a first non-uniform memory access (NUMA) domain ID for the first address and a second NUMA domain ID for the second address. The apparatus can identify a first computing node with memory that corresponds to the first NUMA domain ID and a second computing node with memory that corresponds to the second NUMA domain ID. The apparatus can generate an instruction for copying data in a first memory range of the first computing node to a second memory range of the second computing node. The first memory range can be defined by the first address and the offset value and the second memory range can be defined by the second address and the offset value.

GRAPHICS PROCESSORS AND GRAPHICS PROCESSING UNITS HAVING DOT PRODUCT ACCUMULATE INSTRUCTION FOR HYBRID FLOATING POINT FORMAT

Described herein is a graphics processing unit (GPU) comprising a first processing cluster to perform parallel processing operations, the parallel processing operations including a ray tracing operation and a matrix multiply operation; and a second processing cluster coupled to the first processing cluster, wherein the first processing cluster includes a floating-point unit to perform floating point operations, the floating-point unit is configured to process an instruction using a bfloat16 (BF16) format with a multiplier to multiply second and third source operands while an accumulator adds a first source operand with output from the multiplier.