G06F9/5016

Neural network operation reordering for parallel execution

Techniques are disclosed for reordering operations of a neural network to improve runtime efficiency. In some examples, a compiler receives a description of the neural network comprising a plurality of operations. The compiler may determine which execution engine of a plurality of execution engines is to perform each of the plurality of operations. The compiler may determine an order of performance associated with the plurality of operations. The compiler may identify a runtime inefficiency based on the order of performance and a hardware usage for each of the plurality of operations. An operation may be reordered to reduce the runtime inefficiency. Instructions may be compiled based on the plurality of operations, which include the reordered operation.

Systems and methods for simulation of dynamic systems

A highly parallelized parallel tempering technique for simulating dynamic systems, such as quantum processors, is provided. Replica exchange is facilitated by synchronizing grid-level memory. Particular implementations for simulating quantum processors by representing cells of qubits and couplers in grid-, block-, and thread-level memory are discussed. Parallel tempering of such dynamic systems can be assisted by modifying replicas based on isoenergetic cluster moves (ICMs). ICMs are generated via secondary replicas which are maintained alongside primary replicas and exchanged between blocks and/or generated dynamically by blocks without necessarily being exchanged. Certain refinements, such as exchanging energies and temperatures through grid-level memory, are also discussed.

ELASTICALLY MANAGING WORKERS OF MULTI-WORKER WORKLOADS ON ACCELERATOR DEVICES

The disclosure herein describes elastically managing the execution of workers of multi-worker workloads on accelerator devices. A first worker of a workload is executed on an accelerator device during a first time interval. A first context switch point is identified when the first worker is in a first worker state. At the identified context switch point, a first memory state of the first worker is stored in a host memory and the accelerator device is configured to a second memory state of the second worker. The second worker is executed during a second time interval and a second context switch point is identified at the end of the second time interval when the second worker is in a state that is equivalent to the first worker state. During the intervals, collective communication operations between the workers are accumulated and, at the second context switch point, the accumulated operations are performed.

METHOD AND SYSTEM FOR ALLOCATING GRAPHICS PROCESSING UNIT PARTITIONS FOR A COMPUTER VISION ENVIRONMENT
20230236887 · 2023-07-27 ·

Techniques described herein relate to a method for allocating graphics processing unit partitions for a computer vision environment. The method includes obtaining, by a computer vision (CV) manager, an initial graphics processing unit (GPU) partition allocation request associated with a CV workload; in response to obtaining the initial GPU partition allocation request: obtaining CV workload information associated with the CV workload; obtaining first CV environment configuration information associated with the GPU partition allocation request; generating an optimal GPU partition allocation based on the first CV environment configuration information and the CV workload information using a GPU partition model; and initiating performance of the CV workload in a CV environment based on the optimal GPU partition allocation.

ATTRIBUTING ERRORS TO INPUT/OUTPUT PERIPHERAL DRIVERS
20230236917 · 2023-07-27 ·

A process includes determining, by an operating system agent of a computer system, a first profile that is associated with an input/output (I/O) peripheral of the computer system. The first profile is associated with an error register of the I/O peripheral, and the first profile represents a configuration of the computer system that is associated with the I/O peripheral. The process includes, responsive to a notification of an error being associated with the I/O peripheral, determining, by the operating system agent, a second profile that is associated with the I/O peripheral. The second profile is associated with the error register. Moreover, responsive to the notification of the error, the process includes comparing, by a baseboard management controller of the computer system, the second profile to the first profile. Based on the comparison, the process includes determining, by the baseboard management controller, whether the error is attributable to a driver for the I/O peripheral.

SAFE CRITICAL SECTION OPERATIONS FOR VIRTUAL MACHINES WITH VIRTUAL CENTRAL PROCESSING UNIT OVERCOMMIT
20230236901 · 2023-07-27 ·

Safe critical section operations for virtual machines with virtual central processing unit overcommit are provided by: in response to identifying a preempting task to run on a first physical central processing unit (PCPU) from a second PCPU, setting a status of a flag in a virtual memory used by a first virtual central processing unit (VCPU) running on the first PCPU to indicate that the preempting task will interrupt the first VCPU; in response to initiating execution of a read-side critical section operation scheduled by the first VCPU to run on the first PCPU, checking the status of the flag in the virtual memory; and in response to the status of the flag being positive: exiting the first VCPU to a hypervisor; executing, by the hypervisor, the preempting task on the first PCPU; and after completing the preempting task, continuing execution of the read-side critical section operation.

DYNAMIC GPU-ENABLED VIRTUAL MACHINE PROVISIONING ACROSS CLOUD PROVIDERS
20230236902 · 2023-07-27 ·

Systems and methods are provided for dynamic GPU-enabled VM provisioning across cloud service providers. An example method can include providing a VM pool that includes a GPU-optimized VM and a non-GPU-optimized VM operating in different clouds. A control plane can receive an indication that a user has submitted a machine-learning workload request, determine whether a GPU-optimized VM is available and instruct the non-GPU-optimized VM to send the workload to the GPU-optimized VM in a peer-to-peer manner. The GPU-optimized VM computes the workload and returns a result to the requesting VM. The control plane can instantiate a new GPU-optimized VM (or terminate it when the workload is complete) to dynamically maintain a desired number of available GPU-optimized VMs.

Technology for moving data between virtual machines without copies

A processor comprises a core, a cache, and a ZCM manager in communication with the core and the cache. In response to an access request from a first software component, wherein the access request involves a memory address within a cache line, the ZCM manager is to (a) compare an OTAG associated with the memory address against a first ITAG for the first software component, (b) if the OTAG matches the first ITAG, complete the access request, and (c) if the OTAG does not match the first ITAG, abort the access request. Also, in response to a send request from the first software component, the ZCM manager is to change the OTAG associated with the memory address to match a second ITAG for a second software component. Other embodiments are described and claimed.

Trims for memory performance targets of applications
11714547 · 2023-08-01 · ·

A memory sub-system can receive a definition of a performance target for each of a number of applications that use the memory sub-system for storage. The memory sub-system can create a plurality of partitions according to the definitions and assign each of the partitions to a block group. The memory sub-system can operate each block group with a trim tailored to the performance target corresponding to that block group and application.

Inter-server memory pooling

A memory allocation device for deployment within a host server computer includes control circuitry, a first interface to a local processing unit disposed within the host computer and local operating memory disposed within the host computer, and a second interface to a remote computer. The control circuitry allocates a first portion of the local memory to a first process executed by the local processing unit and transmits, to the remote computer via the second interface, a request to allocate to a second process executed by the local processing unit a first portion of a remote memory disposed within the remote computer. The control circuitry further receives instructions via the first interface to store data at a memory address within the first portion of the remote memory and transmits those instructions to the remote computer via the second interface.