Patent classifications
G06F8/445
MEMORY-BOUND SCHEDULING
Certain aspects of the present disclosure provide techniques for generating execution schedules, comprising receiving a data flow graph for a process, where data flow graph comprises a plurality of nodes and a plurality of edge; generating a topological ordering for the data flow graph based at least in part on memory utilization of the process; generating a first modified topological ordering by inserting, into the topological ordering, one or more new nodes corresponding to memory access based on a predefined memory capacity; allocating units of memory in the memory based on the first modified topological ordering; and generating a second modified topological ordering by rearranging one or more nodes in the first modified topological ordering, where the second modified topological ordering enables increased parallel utilization of a plurality of hardware components.
Application interface on multiple processors
A method and an apparatus that execute a parallel computing program in a programming language for a parallel computing architecture are described. The parallel computing program is stored in memory in a system with parallel processors. The parallel computing program is stored in a memory to allocate threads between a host processor and a GPU. The programming language includes an API to allow an application to make calls using the API to allocate execution of the threads between the host processor and the GPU. The programming language includes host function data tokens for host functions performed in the host processor and kernel function data tokens for compute kernel functions performed in one or more compute processors, e.g., GPUs or CPUs, separate from the host processor.
SYSTEMS AND METHODS FOR OPTIMIZING NESTED LOOP INSTRUCTIONS IN PIPELINE PROCESSING STAGES WITHIN A MACHINE PERCEPTION AND DENSE ALGORITHM INTEGRATED CIRCUIT
In one embodiment, a method for improving a performance of an integrated circuit includes implementing one or more computing devices executing a compiler program that: (i) evaluates a target instruction set intended for execution by an integrated circuit; (ii) identifies one or more nested loop instructions within the target instruction set based on the evaluation; (iii) evaluates whether a most inner loop body within the one or more nested loop instructions comprises a candidate inner loop body that requires a loop optimization that mitigates an operational penalty to the integrated circuit based on one or more executional properties of the most inner loop instruction; and (iv) implements the loop optimization that modifies the target instruction set to include loop optimization instructions to control, at runtime, an execution and a termination of the most inner loop body thereby mitigating the operational penalty to the integrated circuit.
METHOD AND APPARATUS FOR FUNCTIONAL UNIT ASSIGNMENT
There is provided a method, apparatus and network node for determining an optimal functional unit for currently scheduled instructions. Embodiments divide the function unit assignment problem into separate assessments and subsequently reconciling any conflicts arising from different conclusions for the optimal functional unit. The separate assessments of the optimal functional unit relate to assessment of the optimal functional unit in terms of instruction bundling and the optimal functional unit in terms of latency. For the selection of the best functional unit in terms of instruction bundling, consideration regarding maximizing the size of instruction bundle is considered while taking into account the priority of instructions in the available queue. For the selection of the best functional unit in terms of latency, the most important successor of instruction node is primarily considered.
PRE-INSTRUCTION SCHEDULING REMATERIALIZATION FOR REGISTER PRESSURE REDUCTION
Examples are disclosed herein that relate to performing rematerialization operation(s) on program source code prior to instruction scheduling. In one example, a method includes prior to performing instruction scheduling on program source code, for each basic block of the program source code, determining a register pressure at a boundary of the basic block, determining whether the register pressure at the boundary is greater than a target register pressure, based on the register pressure at the boundary being greater than the target register pressure, identifying one or more candidate instructions in the basic block suitable for rematerialization to reduce the register pressure at the boundary, and performing a rematerialization operation on at least one of the one or more candidate instructions to reduce the register pressure at the boundary to be less than the target register pressure.
RESCHEDULING JIT COMPILATION BASED ON JOBS OF PARALLEL DISTRIBUTED COMPUTING FRAMEWORK
A computer-implemented method is provided for compilation rescheduling from among four compilation levels comprising level 1, level 2, level 3, and level 4 on a parallel distributed computing framework running processes for a plurality of jobs of a virtual machine. The method bypasses a program analysis overhead that includes measuring a compiled method execution time by identifying completed compilation levels of a Just In Time compilation. The method finds a repetition of a same process in the processes for the plurality of jobs of the virtual machine from profiles by comparing main class names, virtual machine parameters, and Jar file types therein. The method applies a compilation scheduling for the same process a next time the same process runs based on a result of the checking the transition, by (i) compiling at the level 1 at least some methods for the same process responsive to the virtual machine finishing without compiling the at least some methods for the same process at the level 4 after compiling the at least some of the methods at a level in between the level 1 and the level 4, and (ii) compiling at the level 4 at least a subset of the methods earlier than an original scheduled time responsive to at least the subset of the methods compiled at the level 4 being infrequently invoked below a threshold amount.
Compile-time scheduling
Scheduling of the operations of an integrated circuit device such as a hardware accelerator, including scheduling of movement of data into and out of the accelerator, can be performed by a compiler that produces program code for the accelerator. The compiler can produce a graph that represents operations to be performed by the accelerator. Using the graph, the compiler can determine estimated execution times for the operations represented by each node in the graph. The compiler can schedule operations by determining an estimated execution time for set of dependent operations that depend from an operation. The compiler can then select an operation that has a shortest estimated execution time from among a set of operations and which has a set of dependent operations that has a longest estimated execution time as compared to other sets of dependent operations.
LARGE LOOKUP TABLES FOR AN IMAGE PROCESSOR
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for supporting large lookup tables on an image processor. One of the methods includes receiving an input kernel program for an image processor having a two-dimensional array of execution lanes, a shift-register array, and a plurality of memory banks. If the kernel program has an instruction that reads a lookup table value for a lookup table partitioned across the plurality of memory banks, the instruction in the kernel program are replaced with a sequence of instructions that, when executed by an execution lane, causes the execution lane to read a first value from a local memory bank and a second value from the local memory bank on behalf of another execution lane belonging to a different group of execution lanes.
INFORMATION PROCESSING METHOD AND COMPUTER-READABLE RECORDING MEDIUM HAVING STORED THEREIN OPTIMIZATION PROGRAM
An information processing method executed by a computer, the method includes executing a target program to acquire number of executions for each of a plurality of program codes; selecting a combination of program codes related to a plurality of assignment statements from among program codes related to assignment statements having a higher number of executions based on the acquired number of executions; when the target program is changed, executing the changed target program to calculate an execution accuracy and an operation time so that parallel processing using an SIMD operation function is executed for each of the program codes related to the plurality of assignment statements included in the selected combination; and searching for the combination so that the calculated execution accuracy and operation time satisfy a predetermined condition.
Systems and methods for optimizing nested loop instructions in pipeline processing stages within a machine perception and dense algorithm integrated circuit
In one embodiment, a method for improving a performance of an integrated circuit includes implementing one or more computing devices executing a compiler program that: (i) evaluates a target instruction set intended for execution by an integrated circuit; (ii) identifies one or more nested loop instructions within the target instruction set based on the evaluation; (iii) evaluates whether a most inner loop body within the one or more nested loop instructions comprises a candidate inner loop body that requires a loop optimization that mitigates an operational penalty to the integrated circuit based on one or more executional properties of the most inner loop instruction; and (iv) implements the loop optimization that modifies the target instruction set to include loop optimization instructions to control, at runtime, an execution and a termination of the most inner loop body thereby mitigating the operational penalty to the integrated circuit.