G06F8/445

MANAGEMENT OF THE UNTRANSLATED TO TRANSLATED CODE STEERING LOGIC IN A DYNAMIC BINARY TRANSLATION BASED PROCESSOR

A processor comprising an instruction execution circuit to execute a second code stored at a second address of a memory, wherein the second code is translated from a first code stored at a first address of the memory and a translation table (TT) controller coupled to a translation table to store a TT entry comprising a mapping between the first address and the second address and an attribute field comprising an attribute value associated with execution of the second code, wherein the TT controller is to monitor execution of the second code by the instruction execution circuit and update, based on a performance metric of the execution, the attribute value of the TT entry.

Technologies for indirectly calling vector functions

Technologies for indirectly calling vector functions include a compute device that includes a memory device to store source code and a compiler module. The compiler module is to identify a set of declarations of vector variants for scalar functions in the source code, generate a vector variant address map for each set of vector variants, generate an offset map for each scalar function, and identify, in the source code, an indirect call to the scalar functions, wherein the indirect call is to be vectorized. The compiler module is also to determine, based on a context of the indirect call, a vector variant to be called and store, in object code and in association with the indirect call, an offset into one of the vector variant address maps based on (i) the determined vector variant to be called and (ii) the offset map that corresponds to each scalar function.

Compiler for translating between a virtual image processor instruction set architecture (ISA) and target hardware having a two-dimensional shift array structure
10599407 · 2020-03-24 · ·

A method is described that includes translating higher level program code including higher level instructions having an instruction format that identifies pixels to be accessed from a memory with first and second coordinates from an orthogonal coordinate system into lower level instructions that target a hardware architecture having an array of execution lanes and a shift register array structure that is able to shift data along two different axis. The translating includes replacing the higher level instructions having the instruction format with lower level shift instructions that shift data within the shift register array structure.

Preprocessing tensor operations for optimal compilation
10592213 · 2020-03-17 · ·

Techniques to preprocess tensor operations prior to code generation to optimize compilation are disclosed. A computer readable representation of a linear algebra or tensor operation is received. A code transformation software component performs transformations include output reduction and fraction removal. The result is a set of linear equations of a single variable with integer coefficients. Such a set lends itself to more efficient code generation during compilation by a code generation software component. Use cases disclosed include targeting a machine learning hardware accelerator, receiving code in the form of an intermediate language generated by a cross-compiler with multiple front ends supporting multiple programming languages, and cloud deployment and execution scenarios.

LATENCY SCHEDULING MECHANISM
20200065073 · 2020-02-27 · ·

An apparatus to facilitate instruction scheduling is disclosed. The apparatus includes one or more processors to receive a block of instructions, divide the block of instructions into a plurality of sub-blocks based on a register pressure bounded by a predetermined threshold and instructions in each of the plurality of sub-blocks for processing.

Method and device for processing an irregular application

A method and a device for processing an irregular application are disclosed. The method comprises: determining M classes of tasks of the irregular application; executing the M classes of tasks in parallel, wherein each task has an index respectively; for the i-th task in the x-th class of task of the M classes of tasks: when the i-th task is executed to a rendezvous, stalling the i-th task, and determining a rule corresponding to the i-th task; inspecting current state of the i-th task according to the rule corresponding to the i-th task so as to steer the continued execution of the i-th task. According to the embodiment of the present disclosure, irregular applications can be correctly and automatically executed with high performance in a manner of fine-grained pipeline parallelism.

Application interface on multiple processors
10534647 · 2020-01-14 · ·

A method and an apparatus that execute a parallel computing program in a programming language for a parallel computing architecture are described. The parallel computing program is stored in memory in a system with parallel processors. The parallel computing program is stored in a memory to allocate threads between a host processor and a GPU. The programming language includes an API to allow an application to make calls using the API to allocate execution of the threads between the host processor and the GPU. The programming language includes host function data tokens for host functions performed in the host processor and kernel function data tokens for compute kernel functions performed in one or more compute processors, e.g., GPUs or CPUs, separate from the host processor.

Vector processing system
10509653 · 2019-12-17 · ·

Vector processing systems and methods disclosed herein generate efficient vector initialization code that leverages performance advantages of single instruction, multiple data (SIMD) instructions and immediate operands. In some embodiments, a vector processing system scans existing code for initialization syntax that specifies values which match one or more target patterns. Where the vector processing system identifies one or more of these target patterns within the specified values, the vector processing system generates enhanced vector initialization code. This enhanced vector initialization code is configured to outperform vector initialization code that sequentially loads discrete values to discrete channels within a vector register.

Synchronization instruction insertion method and apparatus
11934832 · 2024-03-19 · ·

This application discloses example synchronization instruction insertion methods and example apparatuses. One example method includes obtaining a first program block comprising one or more statements, where each of the one or more statements includes one or more function instructions. A first function instruction and a second function instruction between which data dependency exists in the first program block can then be determined. A synchronization instruction pair between a first statement including the first function instruction and a second statement including the second function instruction can then be inserted.

GPU wave-to-wave optimization
11928754 · 2024-03-12 · ·

This disclosure provides systems, devices, apparatus, and methods, including computer programs encoded on storage media, for GPU wave-to-wave optimization. A graphics processor may execute a shader program for a first wave associated with a draw call or a compute kernel. The graphics processor may identify at least one first indication for the first wave associated with the draw call or the compute kernel. The graphics processor may store the at least one first indication for the first wave to a memory location. The graphics processor may execute the shader program for at least one second wave associated with the draw call or the compute kernel. The execution of the shader program for the at least one second wave may be based on the shader program for the at least one second wave reading the memory location to retrieve the at least one first indication.