G06F8/445

DEVICES, METHODS, AND MEDIA FOR EFFICIENT DATA DEPENDENCY MANAGEMENT FOR IN-ORDER ISSUE PROCESSORS

Methods, devices and media for efficient data dependency management for in-order issue processors are described. In various embodiments described herein, methods, devices and media are disclosed that provide techniques for managing RAW data dependencies between instructions in a constrained hardware environment. The described techniques include initial wait station allocation of write instructions, followed by wait station allocation conflict resolution methods that use a greedy algorithm to optimize a cost function based on the estimated latency of a single instruction. Efficient compilation and reduced execution time may be achieved in some embodiments. Methods and devices for compiling source code are described, as well as devices for executing the compiled machine code and media for storing compiled machine code.

LOCATE NEURAL NETWORK PERFORMANCE HOT SPOTS
20220350619 · 2022-11-03 ·

Embodiments for locating performance hot spots include collecting sample data having instruction addresses, the sample data being for a neural network model and determining instructions in the instruction addresses that are performance hot spots. A listing file is used to map the instructions of the sample data that are performance hot spots to locations in a lower-level intermediate representation. A mapping file is used to map the locations of the lower-level intermediate representation that are performance hot spots to operations in one or more higher-level representations, one or more of the operations corresponding to the performance hot spots, the mapping file being generated from compiling the neural network model.

Quantum instruction compiler for optimizing hybrid algorithms

A compiler for a gate-based superconducting quantum computer compiles hybrid classical/quantum algorithms for quantum processing cells with different configurations. The compiler inputs the algorithm and outputs code in a target language executable by a quantum processing cell of a quantum processing system that can execute the algorithm. The compiler includes various functionality, such as: parsing, analyzing control flows, addressing, compressing, and translating. The compiler optimizes algorithms in various manners using the functionality. Some optimizations include addressing efficiently, compressing based on simulations, and translating for efficient execution of parametric functions. The compiler may function in the environment of a cloud quantum computing system. The cloud quantum computing system may receive algorithms from remote access nodes for execution on local classical and quantum computing systems.

LOAD LATENCY AMELIORATION USING BUNCH BUFFERS
20230031902 · 2023-02-02 · ·

Techniques for task processing based on load latency amelioration using bunch buffers are disclosed. A two-dimensional array of compute elements is accessed. Each compute element within the array of compute elements is known to a compiler and is coupled to its neighboring compute elements within the array of compute elements. Control for the compute elements is provided on a cycle-by-cycle basis. The control is enabled by a stream of wide control words generated by the compiler. Sets of control word bits are loaded into buffers. Each buffer is associated with and coupled to a unique compute element within the array of compute elements. The sets of control word bits provide operational control for the compute element with which it is associated. Operations are executed within the array of elements. The operations are based on a selected set of control word bits which comprise a control word bunch.

COMPUTER-READABLE RECORDING MEDIUM STORING PROGRAM AND INFORMATION PROCESSING METHOD
20230087152 · 2023-03-23 · ·

A recording medium stores a program for generating a source code that indicates processing on a sparse matrix and for causing a computer to execute a process including: acquiring second codes by optimizing, with a convex polyhedral model, a first code in which loop processing on a matrix is written in a static control part format; converting the second codes into source code candidates, based on sparse matrix information that indicates a variable that represents a non-zero element of the sparse matrix, expression information that indicates an operation expression that corresponds to a function included in the second codes, and data type information that indicates a type to be used for the variable; and selecting the source code from among the source code candidates in accordance with evaluation of processing performance for the sparse matrix in a case where each of the source code candidates is used.

TECHNIQUES FOR PARALLEL EXECUTION
20220342673 · 2022-10-27 ·

Apparatuses, systems, and techniques to identify instructions for advanced execution. In at least one embodiment, a processor performs one or more instructions that have been identified by a compiler to be speculatively performed in parallel.

COMPILER DEVICE, INSTRUCTION GENERATION METHOD, PROGRAM, COMPILING METHOD, AND COMPILER PROGRAM
20230131430 · 2023-04-27 ·

A compiler device, for generating an instruction sequence to be executed by an arithmetic processing device, includes at least one memory and at least one processor. The at least one processor is configured to receive a first instruction sequence for a first process and a second instruction sequence for a second process to be executed after the first process; generate third instructions, each third instruction being generated by merging a first instruction included in the first instruction sequence and a second instruction included in the second instruction sequence; and generate a third instruction sequence by concatenating the third instructions, instructions included in the first instruction sequence that are not merged into the third instructions, and instructions other than the second instruction among the plurality of instructions included in the second instruction sequence that are not merged into the one or more third instructions.

Method of using multidimensional blockification to optimize computer program and device thereof

Disclosed embodiments relate to a method and device for optimizing compilation of source code. The proposed method receives a first intermediate representation code of a source code and analyses each basic block instruction of the plurality of basic block instructions contained in the first intermediate representation code for blockification. In order to blockify the identical instructions, the one or more groups of basic block instructions are assessed for eligibility of blockification. Upon determining as eligible, the group of basic block instructions are blockified using one of one dimensional SIMD vectorization and two-dimensional SIMD vectorization. The method further generates a second intermediate representation of the source code which is translated to executable target code with more efficient processing capacity.

COMPILER-BASED INPUT SYNCHRONIZATION FOR PROCESSOR WITH VARIANT STAGE LATENCIES

The technology disclosed provides a system that comprises a processor with computing units on an integrated circuit substrate. The processor is configured to map a program across multiple hardware stages with each hardware stage executing a corresponding operation of the program at a different stage latency dependent on an operation type and an operand format. The system further comprises a runtime logic that configures the compute units with configuration data. The configuration data causes first and second producer hardware stages in a given compute unit to execute first and second data processing operations and produce first and second outputs at first and second stage latencies, and synchronizes consumption of the first and second outputs by a consumer hardware stage in the given compute unit for execution of a third data processing operation by introducing a register storage delay that compensates for a difference between the first and second stage latencies.

System for an instruction set agnostic runtime architecture
09823939 · 2017-11-21 · ·

A system for an agnostic runtime architecture. The system includes a close to bare metal JIT conversion layer, a runtime native instruction assembly component included within the conversion layer for receiving instructions from a guest virtual machine, and a runtime native instruction sequence formation component included within the conversion layer for receiving instructions from native code. The system further includes a dynamic sequence block-based instruction mapping component included within the conversion layer for code cache allocation and metadata creation, and is coupled to receive inputs from the runtime native instruction assembly component and the runtime native instruction sequence formation component, and wherein the dynamic sequence block-based instruction mapping component receives resulting processed instructions from the runtime native instruction assembly component and the runtime native instruction sequence formation component and allocates the resulting processed instructions to a processor for execution.