G06F2212/2542

Object memory data flow triggers
11579774 · 2023-02-14 · ·

Embodiments of the invention provide systems and methods for managing processing, memory, storage, network, and cloud computing to significantly improve the efficiency and performance of processing nodes. More specifically, embodiments of the present invention are directed to an instruction set of an object memory fabric. This object memory fabric instruction set can include trigger instructions defined in metadata for a particular memory object. Each trigger instruction can comprise a single instruction and action based on reference to a specific object to initiate or perform defined actions such as pre-fetching other objects or executing a trigger program.

Acceleration management node, acceleration node, client, and method
11579907 · 2023-02-14 · ·

Embodiments of the present application provide an acceleration management node. The acceleration management node separately receives acceleration device information of all acceleration devices. The acceleration device information includes an algorithm type, an acceleration bandwidth or non-uniform memory access architecture (NUMA). The acceleration management node obtains an invocation request from a client. The acceleration management node queries the acceleration device information to determine, from all the acceleration devices of the at least one acceleration node, a target acceleration device matching the invocation request. The acceleration management node further instructs a target acceleration node to respond to the invocation request.

IDENTIFICATION OF A COMPUTING DEVICE ACCESSING A SHARED MEMORY
20180004666 · 2018-01-04 ·

A method for identifying, in a system including two or more computing devices that are able to communicate with each other, with each computing device having with a cache and connected to a corresponding memory, a computing device accessing one of the memories, includes monitoring memory access to any of the memories; monitoring cache coherency commands between computing devices; and identifying the computing device accessing one of the memories by using information related to the memory access and cache coherency commands.

Graphics processors and graphics processing units having dot product accumulate instruction for hybrid floating point format

Described herein is a graphics processing unit (GPU) comprising a first processing cluster to perform parallel processing operations, the parallel processing operations including a ray tracing operation and a matrix multiply operation; and a second processing cluster coupled to the first processing cluster, wherein the first processing cluster includes a floating-point unit to perform floating point operations, the floating-point unit is configured to process an instruction using a bfloat16 (BF16) format with a multiplier to multiply second and third source operands while an accumulator adds a first source operand with output from the multiplier.

Lock-free work-stealing thread scheduler

Systems and methods are provided for lock-free thread scheduling. Threads may be placed in a ring buffer shared by all computer processing units (CPUs), e.g., in a node. A thread assigned to a CPU may be placed in the CPU's local run queue. However, when a CPU's local run queue is cleared, that CPU checks the shared ring buffer to determine if any threads are waiting to run on that CPU, and if so, the CPU pulls a batch of threads related to that ready-to-run thread to execute. If not, an idle CPU randomly selects another CPU to steal threads from, and the idle CPU attempts to dequeue a thread batch associated with the CPU from the shared ring buffer. Polling may be handled through the use of a shared poller array to dynamically distribute polling across multiple CPUs.

Multi-uplink device enumeration and management

A device includes a plurality of ports and a plurality of capability registers that correspond to a respective one of the plurality of ports. The device is to connect to one or more processors of a host device through the plurality of ports, and each of the plurality of ports comprises a respective protocol stack to support a respective link between the corresponding port and the host device according to a particular interconnect protocol. Each of the plurality of capability registers comprises a respective set of fields for use in configuration of the link between its corresponding port and one of the one or more processors of the host device. The fields include a field to indicate an association between the port and a particular processor, a field to indicate a port identifier for the port, and a field to indicate a total number of ports of the device.

Techniques for increasing the isolation of workloads within a multiprocessor instance
11693708 · 2023-07-04 · ·

In various embodiments, an isolation application determines processor assignment(s) based on a performance cost estimate. The performance cost estimate is associated with an estimated level of cache interference arising from executing a set of workloads on a set of processors. Subsequently, the isolation application configures at least one processor included in the set of processors to execute at least a portion of a first workload that is included in the set of workloads based on the processor assignment(s). Advantageously, because the isolation application generates the processor assignment(s) based on the performance cost estimate, the isolation application can reduce interference in a non-uniform memory access (NUMA) microprocessor instance.

Task scheduling for machine-learning workloads
11544113 · 2023-01-03 · ·

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, are described for scheduling tasks of ML workloads. A system receives requests to perform the workloads and determines, based on the requests, resource requirements to perform the workloads. The system includes multiple hosts and each host includes multiple accelerators. The system determines a quantity of hosts assigned to execute tasks of the workload based on the resource requirement and the accelerators for each host. For each host in the quantity of hosts, the system generates a task specification based on a memory access topology of the host. The specification specifies the task to be executed at the host using resources of the host that include the multiple accelerators. The system provides the task specifications to the hosts and performs the workloads when each host executes assigned tasks specified in the task specifications for the host.

METHOD AND SYSTEM FOR CONSTRUCTING PERSISTENT MEMORY INDEX IN NON-UNIFORM MEMORY ACCESS ARCHITECTURE
20220413952 · 2022-12-29 ·

A method for constructing a persistent memory index in a non-uniform memory access architecture includes: maintaining partial persistent views in a persistent memory and maintaining a global volatile view in a DRAM; an underlying persistent memory index processing a request in a foreground thread when cold data is accessed; when hot data is accessed, reading a key-value pair for a piece of hot data in the global volatile view in response to a query operation carried in the request, and in response to an insert/update/delete operation carried in the request, updating a local partial persistent view and the global volatile view; and in response to a hotspot migration, a background thread generating new partial persistent views and a new global volatile view, and recycling the partial persistent views and the global volatile view for old hot data into the underlying persistent memory index.

High bandwidth memory system with crossbar switch for dynamically programmable distribution scheme

A system comprises a processor coupled to a plurality of memory units. Each of the plurality of memory units includes a request processing unit and a plurality of memory banks. Each request processing unit includes a plurality of decomposition units and a crossbar switch, the crossbar switch communicatively connecting each of the plurality of decomposition units to each of the plurality of memory banks. The processor includes a plurality of processing elements and a communication network communicatively connecting the plurality of processing elements to the plurality of memory units. At least a first processing element of the plurality of processing elements includes a control logic unit and a matrix compute engine. The control logic unit is configured to access the plurality of memory units using a dynamically programmable distribution scheme.