G06F8/458

Thread-local return structure for asynchronous state machine

Reuse of a thread-local return data structure to prevent a return data structure from being allocated every time asynchronous functions return. The system returns thread operation from the asynchronous function back to the caller function in a manner that the return data structure can be reused for future asynchronous function returns within that same thread. To do so, the system first accesses data that was generated by the asynchronous function in response to the caller function placing the function call to the asynchronous function. To determine if reuse is appropriate, the system determines that the caller function will use the return data structure as populated only once. If so, the system populates the reusable thread-local return data structure and returns that data structure to the caller.

Low-overhead detection techniques for synchronization problems in parallel and concurrent software

The techniques described herein may provide techniques to detect, categorize, and diagnose synchronization issues that provide improved performance and issue resolution. For example, in an embodiment, a method may comprise detecting occurrence of synchronization performance problems in software code, when at least some detected synchronization performance problems occur when a contention rate for software locks is low, determining a cause of the synchronization performance problems, and modifying the software code to remedy the cause of the synchronization performance problems so as to improve synchronization performance of the software code.

Deep Neural Networks Compiler for a Trace-Based Accelerator

A method of compiling neural network code to executable instructions for execution by a computational acceleration system having a memory circuit and one or more acceleration circuits having a maps data buffer and a kernel data buffer is disclosed, such as for execution by an inference engine circuit architecture which includes a matrix-matrix (MM) accelerator circuit having multiple operating modes to provide a complete matrix multiplication. A representative compiling method includes generating a list of neural network layer model objects; fusing available functions and layers in the list; selecting a cooperative mode, an independent mode, or a combined cooperative and independent mode for execution; selecting a data movement mode and an ordering of computations which reduces usage of the memory circuit; generating an ordered sequence of load objects, compute objects, and store objects; and converting the ordered sequence of load objects, compute objects, and store objects into the executable instructions.

THREAD-LOCAL RETURN STRUCTURE FOR ASYNCHRONOUS STATE MACHINE
20220066759 · 2022-03-03 ·

Reuse of a thread-local return data structure to prevent a return data structure from being allocated every time asynchronous functions return. The system returns thread operation from the asynchronous function back to the caller function in a manner that the return data structure can be reused for future asynchronous function returns within that same thread. To do so, the system first accesses data that was generated by the asynchronous function in response to the caller function placing the function call to the asynchronous function. To determine if reuse is appropriate, the system determines that the caller function will use the return data structure as populated only once. If so, the system populates the reusable thread-local return data structure and returns that data structure to the caller.

SYSTEM AND METHOD TO ACCELERATE REDUCE OPERATIONS IN GRAPHICS PROCESSOR

Embodiments described herein provide a system, method, and apparatus to accelerate reduce operations in a graphics processor. One embodiment provides an apparatus including one or more processors, the one or more processors including a first logic unit to perform a merged write, barrier, and read operation in response to a barrier synchronization request from a set of threads in a work group, synchronize the set of threads, and report a result of an operation specified in association with the barrier synchronization request.

Application interface on multiple processors
11106504 · 2021-08-31 · ·

A method and an apparatus that execute a parallel computing program in a programming language for a parallel computing architecture are described. The parallel computing program is stored in memory in a system with parallel processors. The parallel computing program is stored in a memory to allocate threads between a host processor and a GPU. The programming language includes an API to allow an application to make calls using the API to allocate execution of the threads between the host processor and the GPU. The programming language includes host function data tokens for host functions performed in the host processor and kernel function data tokens for compute kernel functions performed in one or more compute processors, e.g., GPUs or CPUs, separate from the host processor.

DATA FLOW PROCESSING METHOD AND RELATED DEVICE
20210232394 · 2021-07-29 ·

The present disclosure relates to data flow processing methods and devices. One example method includes obtaining a dependency relationship and an execution sequence of operating a data flow by a plurality of processing units, generating synchronization logic based on the dependency relationship and the execution sequence, and inserting the synchronization logic into an operation pipeline of each of the plurality of processing unit to generate executable code.

LOOP LOCK RESERVATION
20210303373 · 2021-09-30 ·

Embodiments relate to a system, program product, and method for implementing loop lock reservations, and, more specifically, for holding a lock reservation across some or all of the iterations of a loop, and under certain conditions, temporarily effect a running thread to yield the reservation and allow other threads to enter the lock.

Efficient profiling-based lock management in just-in-time compilers

Aspects of the present disclosure describe techniques for managing locks in just-in-time compiled code in a software application. An example method generally includes profiling locks by during execution of the JIT compiled code. Locks are generally profiled by identifying locks on resources accessed by the JIT compiled code, and recording access information for each of the identified locks. When a safepoint is reached during execution of the JIT compiled code, one or more locks eligible for conversion to a biased lock are identified .based on the recorded access information for each of the identified locks, one or more locks eligible for conversion to a biased lock. Each respective lock of the one or more eligible locks is converted to a biased lock based on a current lock status of the respective lock.

CODE COMPILATION FOR SCALING ACCELERATORS

A computer system comprises a work accelerator, a gateway the transfer of data to the accelerator from external storage, the accelerator executes a first compiled code sequence to perform computations on data transferred to the accelerator from the gateway. The first compiled code sequence comprises a synchronisation instruction indicating a barrier between a compute phase in which the compute instructions are executed and an exchange phase, wherein execution of the synchronisation instruction causes an indication of a pre-compiled data exchange synchronisation point to be transferred to the gateway. The gateway comprises a streaming engine storing a second compiled code sequence in the form of a set of data transfer instructions executable by the streaming engine to perform data transfer operations to stream data through the gateway in the exchange phase, wherein the first and second compiled code sequences are generated as a related set at compile time.