G06F8/445

SYSTEMS AND METHODS FOR OPTIMIZING NESTED LOOP INSTRUCTIONS IN PIPELINE PROCESSING STAGES WITHIN A MACHINE PERCEPTION AND DENSE ALGORITHM INTEGRATED CIRCUIT

In one embodiment, a method for improving a performance of an integrated circuit includes implementing one or more computing devices executing a compiler program that: (i) evaluates a target instruction set intended for execution by an integrated circuit; (ii) identifies one or more nested loop instructions within the target instruction set based on the evaluation; (iii) evaluates whether a most inner loop body within the one or more nested loop instructions comprises a candidate inner loop body that requires a loop optimization that mitigates an operational penalty to the integrated circuit based on one or more executional properties of the most inner loop instruction; and (iv) implements the loop optimization that modifies the target instruction set to include loop optimization instructions to control, at runtime, an execution and a termination of the most inner loop body thereby mitigating the operational penalty to the integrated circuit.

ADAPTIVE COMPILATION OF QUANTUM COMPUTING JOBS

Systems, computer-implemented methods, and computer program products to facilitate adaptive compilation of quantum computing jobs are provided. According to an embodiment, a system can comprise a memory that stores computer executable components and a processor that executes the computer executable components stored in the memory. The computer executable components can comprise a selection component that selects a quantum device to execute a quantum program based on one or more run criteria. The computer executable components can further comprise an adaptive compilation component that modifies the quantum program based on one or more attributes of the quantum device to generate a modified quantum program compilation of the quantum program.

LOOP NEST REVERSAL
20200371763 · 2020-11-26 ·

Systems, apparatuses and methods may provide for technology to identify in user code, a nested loop which would result in cache memory misses when executed. The technology further reverses an order of iterations of a first inner loop in the nested loop to obtain a modified nested loop. Reversing the order of iterations increases a number of times that cache memory hits occur when the modified nested loop is executed.

APPLICATION INTERFACE ON MULTIPLE PROCESSORS
20200285521 · 2020-09-10 ·

A method and an apparatus that execute a parallel computing program in a programming language for a parallel computing architecture are described. The parallel computing program is stored in memory in a system with parallel processors. The parallel computing program is stored in a memory to allocate threads between a host processor and a GPU. The programming language includes an API to allow an application to make calls using the API to allocate execution of the threads between the host processor and the GPU. The programming language includes host function data tokens for host functions performed in the host processor and kernel function data tokens for compute kernel functions performed in one or more compute processors, e.g., GPUs or CPUs, separate from the host processor.

SYSTEM AND METHOD TO PERFORM PARALLEL PROCESSING ON A DISTRIBUTED DATASET

Disclosed is a system to perform parallel processing on a distributed dataset. A receiving module, for receiving a dataset along with a set of functions. A partitioning module, for partitioning the dataset into a set of distributed datasets. A distributing module, for distributing the set of distributed datasets amongst a set of computing nodes. A determining module, for determining an applicability of the function on the distributed dataset. An executing module, for executing one or more functions applicable on the distributed dataset. A generating module, for generating processed data for the distributed dataset based upon the executing of the one or more functions.

Applications for hardware accelerators in computing systems
10747516 · 2020-08-18 · ·

An example method of implementing an application for a hardware accelerator having a programmable device coupled to memory is disclosed. The method includes compiling source code of the application to generate logical circuit descriptions of kernel circuits; determining resource availability in a dynamic region of programmable logic of the programmable device, the dynamic region exclusive of a static region of the programmable logic programmed with a host interface configured to interface a computing system having the hardware accelerator; determining resource utilization by the kernel circuits in the dynamic region; determining fitting solutions of the kernel circuits within the dynamic region, each of the fitting solutions defining connectivity of the kernel circuits to banks of the memory; adding a memory subsystem to the application based on a selected fitting solution of the fitting solutions; and generating a kernel image configured to program the dynamic region to implement the kernel circuits and the memory subsystem.

ANNOTATIONS FOR PARALLELIZATION OF USER-DEFINED FUNCTIONS WITH FLEXIBLE PARTITIONING
20200233661 · 2020-07-23 · ·

Annotations can be placed in source code to indicate properties for user-defined functions. A wide variety of properties can be implemented to provide information that can be leveraged when constructing a query execution plan for the user-defined function and associated core database relational operations. A flexible range of permitted partition arrangements can be specified via the annotations. Other supported properties include expected sorting and grouping arrangements, ensured post-conditions, and behavior of the user-defined function.

COMPILER FOR TRANSLATING BETWEEN A VIRTUAL IMAGE PROCESSOR INSTRUCTION SET ARCHITECTURE (ISA) AND TARGET HARDWARE HAVING A TWO-DIMENSIONAL SHIFT ARRAY STRUCTURE
20200201612 · 2020-06-25 ·

A method is described that includes translating higher level program code including higher level instructions having an instruction format that identifies pixels to be accessed from a memory with first and second coordinates from an orthogonal coordinate system into lower level instructions that target a hardware architecture having an array of execution lanes and a shift register array structure that is able to shift data along two different axis. The translating includes replacing the higher level instructions having the instruction format with lower level shift instructions that shift data within the shift register array structure.

Latency scheduling mehanism
10691430 · 2020-06-23 · ·

An apparatus to facilitate instruction scheduling is disclosed. The apparatus includes one or more processors to receive a block of instructions, divide the block of instructions into a plurality of sub-blocks based on a register pressure bounded by a predetermined threshold and instructions in each of the plurality of sub-blocks for processing.

Method and apparatus for detecting inter-instruction data dependency

Embodiments of the present invention disclose a method and an apparatus for detecting inter-instruction data dependency. The method comprises: comparing a thread number corresponding to a historical access operation with a thread number corresponding to a write access operation, if the thread number corresponding to the write access operation is less than the thread number corresponding to the historical access operation, which indicates existence of data dependency for a to-be-detected instruction, terminating the detection; or comparing a thread number corresponding to a historical write access operation with a thread number corresponding to a read access operation, if the thread number corresponding to the read access operation is less than the thread number corresponding to the historical write access operation, which indicates existence of data dependency for the to-be-detected instruction, terminating the detection.