G06F2212/2542

INTEGRATED HETEROGENEOUS PROCESSING UNITS
20170315847 · 2017-11-02 ·

According to an example, an instruction to run a kernel of an application on an apparatus having a first processing unit integrated with a second processing unit may be received. In addition, an application profile for the application at a runtime of the application kernel on the second processing unit may be created, in which the application profile identifies an affinity of the application kernel to be run on either the first processing unit or the second processing unit, and identifies a characterization of an input data set of the application. The application profile may also be stored in a data store.

LOCALITY-AWARE SCHEDULING FOR NIC TEAMING
20170315840 · 2017-11-02 ·

Some embodiments provide a method for distributing packets processed at multiple sockets across a team of network interface controllers (NICs) in a processing system. The method of some embodiments uses existing distribution (or selection) algorithms for distributing traffic across NICs of a NIC team (across several sockets), but augments the method to prioritize local NICs over remote NICs. When active NICs local to a socket associated with a packet are available, the method of some embodiments uses the selection algorithm to select from an array of the active local NICs. When active NICs local to the socket are not available, the method of some embodiments uses the selection algorithm to select from an array of the active NICs of other NICs on the NIC team.

Scheduling method for virtual processors based on the affinity of NUMA high-performance network buffer resources

A scheduling method for virtual processors based on the affinity of NUMA high-performance network buffer resources, including: in a NUMA architecture, when a network interface card (NIC) of a virtual machine is started, getting distribution of the buffer of the NIC on each NUMA node; getting affinities of each NUMA node for the buffer of the network interface card on the basis of an affinity relationship between each NUMA node; determining a target NUMA node in combination with the distribution of the buffer of the NIC on each NUMA node and NUMA node affinities for the buffer of the NIC; scheduling the virtual processor to the CPU on the target NUMA node. The problem of affinity between the VCPU of the virtual machine and the buffer of the NIC not being optimal in the NUMA architecture is solved to reduce the speed of VCPU processing network packets.

Memory distribution across multiple non-uniform memory access nodes
09785581 · 2017-10-10 · ·

A system, methods, and apparatus for determining memory distribution across multiple non-uniform memory access processing nodes are disclosed. An apparatus includes processing nodes, each including processing units and main memory serving as local memory. A bus connects the processing units of each processing node to different main memory of a different processing node as shared memory. Access to local memory has lower memory access latency than access to shared memory. The processing nodes execute threads distributed across the processing nodes, and detect memory accesses made from each processing node for each thread. The processing nodes determine locality values for the thread that represent the fraction of memory accesses made from the processing nodes, and determine processing time values for the threads for a sampling period. The processing nodes determine weighted locality values for the threads, and determine a memory distribution across the processing nodes based on the weighted locality values.

TECHNOLOGIES FOR QUALITY OF SERVICE BASED THROTTLING IN FABRIC ARCHITECTURES

Technologies for quality of service based throttling in a fabric architecture include a network node of a plurality of network nodes interconnected across the fabric architecture via an interconnect fabric. The network node includes a host fabric interface (HFI) configured to facilitate the transmission of data to/from the network node, monitor quality of service levels of resources of the network node used to process and transmit the data, and detect a throttling condition based on a result of the monitored quality of service levels. The HFI is further configured to generate and transmit a throttling message to one or more of the interconnected network nodes in response to having detected a throttling condition. The HFI is additionally configured to receive a throttling message from another of the network nodes and perform a throttling action on one or more of the resources based on the received throttling message. Other embodiments are described herein.

MEMORY DISPOSITION DEVICE, MEMORY DISPOSITION METHOD, AND RECORDING MEDIUM STORING MEMORY DISPOSITION PROGRAM
20220050779 · 2022-02-17 · ·

A memory disposition device of a computer system in which a plurality of nodes exists, each of the nodes including a pair of a processor and a memory, the memory disposition device includes: at least one memory configured to store instructions; and at least one processor configured to execute the instructions to: determine a node in which a memory area to be mapped is disposed; and duplicate the memory area and disposing the memory area, based on a determination result, in a local memory of a node in which a process operates, wherein the at least one processor is configured to invalidate maintenance of cache coherency between the nodes and invalidates access to a remote memory for the process.

RING PROTOCOL FOR LOW LATENCY INTERCONNECT SWITCH

Methods, systems, and apparatus for implementing low latency interconnect switches between CPU's and associated protocols. CPU's are configured to be installed on a main board including multiple CPU sockets linked in communication via CPU socket-to-socket interconnect links forming a CPU socket-to-socket ring interconnect. The CPU's are also configured to transfer data between one another by sending data via the CPU socket-to-socket interconnects. Data may be transferred using a packetized protocol, such as QPI, and the CPU's may also be configured to support coherent memory transactions across CPU's.

NON-UNIFORM MEMORY ACCESS SUPPORT IN A VIRTUAL ENVIRONMENT
20170235679 · 2017-08-17 ·

Methods, systems, and computer program products for configuring devices in a virtual environment are described. An example method includes determining a NUMA node assigned to a virtual machine. A guest of the virtual machine probes a root bus to detect a first device coupled to the root bus. The first device is assigned, based on the determined NUMA node, a first address range of the virtual machine. The guest is notified of an expander coupled to the first virtual root bus. The expander is probed to detect an additional root bus. The guest probes the additional root bus to detect a second device. The second device is assigned, based on the determined NUMA node, a second address range.

Scale-out non-uniform memory access

A computing system that uses a Scale-Out NUMA (“soNUMA”) architecture, programming model, and/or communication protocol provides for low-latency, distributed in-memory processing. Using soNUMA, a programming model is layered directly on top of a NUMA memory fabric via a stateless messaging protocol. To facilitate interactions between the application, OS, and the fabric, soNUMA uses a remote memory controller—an architecturally-exposed hardware block integrated into the node's local coherence hierarchy.

Reducing cache interference based on forecasted processor use
11429525 · 2022-08-30 · ·

In various embodiments, a predictive assignment application computes a forecasted amount of processor use for each workload included in a set of workloads using a trained machine-learning model. Based on the forecasted amounts of processor use, the predictive assignment application computes a performance cost estimate associated with an estimated level of cache interference arising from executing the set of workloads on a set of processors. Subsequently, the predictive assignment application determines processor assignment(s) based on the performance cost estimate. At least one processor included in the set of processors is subsequently configured to execute at least a portion of a first workload that is included in the set of workloads based on the processor assignment(s). Advantageously, because the predictive assignment application generates the processor assignment(s) based on the forecasted amounts of processor use, the isolation application can reduce interference in a non-uniform memory access (NUMA) microprocessor instance.