Patent classifications
G06F8/458
Reduced memory consumption of compiler-transformed asynchronous methods
An asynchronous method is implemented in a manner that reduces the amount of runtime overhead needed to execute the asynchronous method. The data elements needed to suspend an asynchronous method to await completion of an asynchronous operation, to resume the asynchronous method at a resumption point, and to provide a completion status of the caller of the asynchronous method are consolidated into one or two reusable objects. An asynchronous method may be associated with a distinct object pool of reusable objects. The size of a pool and the total size of all pools can be configured statically or dynamically based on runtime conditions.
Device profiling in GPU accelerators by using host-device coordination
System and method of compiling a program having a mixture of host code and device code to enable Profile Guided Optimization (PGO) for device code execution. An exemplary integrated compiler can compile source code programmed to be executed by a host processor (e.g., CPU) and a co-processor (e.g., a GPU) concurrently. The compilation can generate an instrumented executable code which includes: profile instrumentation counters for the device functions; and instructions for the host processor to allocate and initialize device memory for the counters and to retrieve collected profile information from the device memory to generate instrumentation output. The output is fed back to the compiler for compiling the source code a second time to generate optimized executable code for the device functions defined in the source code.
Architecture for virtual instructions
A system including a machine learning accelerator (MLA) hardware configured to perform machine-learning operations according to native instructions; an interpreter computing module configured to: generate, based on virtual instructions, machine language instructions configured to be processed by a processing hardware implementing the interpreter computing module; and cause the processing hardware to perform machine-learning operations according to the machine language instructions; and a compiler computing module associated with the MLA hardware, the compiler computing module configured to: receive instructions for performing an inference using a machine-learning model; based on the received instructions: generate the native instructions configured to be processed by the MLA hardware, the native instructions specifying first machine-learning operations associated with performing the inference; and generate the virtual instructions configured to be processed by the interpreter computing module, the virtual instructions specifying second machine-learning operations associated with performing the inference.
METHOD OF REORDERING CONDITION CHECKS
Described is a computer-implemented method of reordering condition checks. Two or more condition checks in computer code that may be reordered within the code are identified. It is determined that the execution frequency of a later one of the condition checks is satisfied at a greater frequency than a preceding one of the condition checks. It is determined that there is an absence of side effects in the two or more condition checks. The values of the condition checks are propagated and abstract interpretation is performed on the values that are propagated. It is determined that the condition checks are exclusive of each other, and the condition checks are reordered within the computer code.
LOW-OVERHEAD DETECTION TECHNIQUES FOR SYNCHRONIZATION PROBLEMS IN PARALLEL AND CONCURRENT SOFTWARE
The techniques described herein may provide techniques to detect, categorize, and diagnose synchronization issues that provide improved performance and issue resolution. For example, in an embodiment, a method may comprise detecting occurrence of synchronization performance problems in software code, when at least some detected synchronization performance problems occur when a contention rate for software locks is low, determining a cause of the synchronization performance problems, and modifying the software code to remedy the cause of the synchronization performance problems so as to improve synchronization performance of the software code.
Compiler-optimized context switching with compiler-inserted data table for in-use register identification at a preferred preemption point
Compiler-optimized context switching may include receiving an instruction indicating a preferred preemption point comprising an instruction address; storing the preferred preemption point in a data structure; determining, based on the data structure, that the preferred preemption point has been reached by a first thread; determining that preemption of the first thread for a second thread has been requested; and performing a context switch to the second thread.
METHOD AND SYSTEM TO ENABLE PRINT FUNCTIONALITY IN HIGH-LEVEL SYNTHESIS (HLS) DESIGN PLATFORMS
This disclosure generally relates to high-level synthesis (HLS) platforms, and, more particularly, enable print functionality in high-level synthesis (HLS) platforms. The recent availability FPGA-HLS is a great success due to availability of compilers for FPGAs as opposed to hardware description languages (HDLs) that requires special skills. However, the compilers within the HLS design platform includes limited support for all the standard libraries, wherein features like print functionality is not supported. The invention discloses techniques to enable print functionality in HLS design platforms based on source-to-source transformations and stream combining scheme. In addition to enabling print functionality, the invention also discloses a formatter technique to receive-format FPGA data into human interpretable data.
Loop lock reservation
Embodiments relate to a system, program product, and method for implementing loop lock reservations, and, more specifically, for holding a lock reservation across some or all of the iterations of a loop, and under certain conditions, temporarily effect a running thread to yield the reservation and allow other threads to enter the lock.
SYNCHRONIZATION MECHANISMS FOR A MULTI-CORE PROCESSOR
Systems, apparatuses and methods suitable for optimizing synchronization mechanisms for multi-core processors are provided. The synchronizing mechanisms may be optimized by receiving a command stream which comprises a plurality of commands including one or more wait commands, wherein each wait command has an associated state and one or more associated conditions; sequentially processing each command in the command stream until a wait command is reached; checking the state associated with the wait command to be processed, wherein if said state is a blocking state, further processing of commands in the command stream is paused until each of said wait command's associated conditions are met, and wherein if said state is a non-blocking state, the next command in the command stream is retrieved and processed.
REDUCING COMPILER TYPE CHECK COSTS THROUGH THREAD SPECULATION AND HARDWARE TRANSACTIONAL MEMORY
Systems, apparatuses and methods may provide for technology that generates a first compiler output based on input code that includes dynamically typed variable information and generates a second compiler output based on the input code, wherein the second compiler output includes type check code to verify one or more type inferences associated with the first compiler output. The technology may also execute the first compiler output and the second compiler output in parallel via different threads.