Patent classifications
G06F8/452
Optimize control-flow convergence on SIMD engine using divergence depth
There are provided a system, a method and a computer program product for selecting an active data stream (a lane) while running Single Program Multiple Data code on a Single Instruction Multiple Data machine. The machine runs an instruction stream over input data streams and machine increments lane depth counters of all active lanes upon the thread-PC reaching a branch operation and updates the lane-PC of each active lane according to targets of the branch operation. An instruction of the instruction stream includes a barrier indicating a convergence point for all lanes to join. In response to a lane reaching a barrier: evaluating whether all lane-PCs are set to a same thread-PC; and if the lane-PCs are not set to the same thread-PC, selecting an active lane from the plurality of lanes; otherwise, incrementing the lane-PCs of all the lanes, and then selecting an active lane from the plurality of lanes.
Methods and apparatus for executing data-dependent threads in parallel
Methods and apparatus for parallel processing are provided. A multicore processor is described. The multicore processor may include a distributed memory unit with memory nodes coupled to the processor's cores. The cores may be configured to execute parallel threads, and at least one of the threads may be data-dependent on at least one of the other threads. The distributed memory unit may be configured to proactively send shared memory data from a thread that produces the shared memory data to one or more of the threads.
SYSTEM AND METHOD FOR COMPILING HIGH-LEVEL LANGUAGE CODE INTO A SCRIPT EXECUTABLE ON A BLOCKCHAIN PLATFORM
A computer-implemented method (and corresponding system) is provided that enables or facilitates the execution of a portion of source code, written in a high-level language (HLL), on a blockchain platform. The method and system can include a blockchain compiler, arranged to convert a portion of high-level source code into a form that can be used with a blockchain platform. This may be the Bitcoin blockchain or an alternative. The method can include: receiving the portion of source code as input; and generating an output script comprising a plurality of op codes. The op codes are a subset of op codes that are native to a functionally-restricted, blockchain scripting language. The outputted script is arranged and/or generated such that, when executed, the script provides, at least in part, the functionality specified in the source code. The blockchain scripting language is restricted such that it does not natively support complex control-flow constructs or recursion via jump-based loops or other recursive programming constructs. The step of generating the output script may comprise the unrolling at least one looping construct provided in the source code. The method may further comprise providing or using an interpreter or virtual machine arranged to convert the output script into a form that is executable on a blockchain platform.
Transposing a matrix using a streaming engine
Software instructions are executed on a processor within a computer system to configure a steaming engine to operate in either a linear mode or a transpose mode. A stream of addresses is generated using an address generator, in which the stream of addresses includes consecutive nested loop iterations for at least a first loop and a second loop. While in the linear mode, the first loop is treated as an inner loop. While in the transpose mode, the second loop is treated as the inner loop. A matrix can be fetched from memory in the linear mode to provide row-wise vectors. A matrix can be fetched from the memory in the transpose mode to provide column wise vectors.
Systems and methods for optimizing nested loop instructions in pipeline processing stages within a machine perception and dense algorithm integrated circuit
In one embodiment, a method for improving a performance of an integrated circuit includes implementing one or more computing devices executing a compiler program that: (i) evaluates a target instruction set intended for execution by an integrated circuit; (ii) identifies one or more nested loop instructions within the target instruction set based on the evaluation; (iii) evaluates whether a most inner loop body within the one or more nested loop instructions comprises a candidate inner loop body that requires a loop optimization that mitigates an operational penalty to the integrated circuit based on one or more executional properties of the most inner loop instruction; and (iv) implements the loop optimization that modifies the target instruction set to include loop optimization instructions to control, at runtime, an execution and a termination of the most inner loop body thereby mitigating the operational penalty to the integrated circuit.
SYSTEMS AND METHODS FOR OPTIMIZING NESTED LOOP INSTRUCTIONS IN PIPELINE PROCESSING STAGES WITHIN A MACHINE PERCEPTION AND DENSE ALGORITHM INTEGRATED CIRCUIT
In one embodiment, a method for improving a performance of an integrated circuit includes implementing one or more computing devices executing a compiler program that: (i) evaluates a target instruction set intended for execution by an integrated circuit; (ii) identifies one or more nested loop instructions within the target instruction set based on the evaluation; (iii) evaluates whether a most inner loop body within the one or more nested loop instructions comprises a candidate inner loop body that requires a loop optimization that mitigates an operational penalty to the integrated circuit based on one or more executional properties of the most inner loop instruction; and (iv) implements the loop optimization that modifies the target instruction set to include loop optimization instructions to control, at runtime, an execution and a termination of the most inner loop body thereby mitigating the operational penalty to the integrated circuit.
System and method of loop vectorization by compressing indices and data elements from iterations based on a control mask
Loop vectorization methods and apparatus are disclosed. An example method includes generating a first control mask for a set of iterations of a loop by evaluating a condition of the loop, wherein generating the first control mask includes setting a bit of the control mask to a first value when the condition indicates that an operation of the loop is to be executed, and setting the bit of the first control mask to a second value when the condition indicates that the operation of the loop is to be bypassed. The example method also includes compressing indexes corresponding to the first set of iterations of the loop according to the first control mask.
Alternative loop limits for accessing data in multi-dimensional tensors
Methods, systems, and apparatus for accessing a N-dimensional tensor are described. In some implementations, a method includes, for each of one or more first iterations of a first nested loop, performing iterations of a second nested loop that is nested within the first nested loop until a first loop bound for the second nested loop is reached. A number of iterations of the second nested loop for the one or more first iterations of the first nested loop is limited by the first loop bound in response to the second nested loop having a total number of iterations that exceeds a value of a hardware property of the computing system. After a penultimate iteration of the first nested loop has completed, one or more iterations of the second nested loop are performed for a final iteration of the first nested loop until an alternative loop bound is reached.
NEURAL NETWORK OPERATION REORDERING FOR PARALLEL EXECUTION
Techniques are disclosed for reordering operations of a neural network to improve runtime efficiency. In some examples, a compiler receives a description of the neural network comprising a plurality of operations. The compiler may determine which execution engine of a plurality of execution engines is to perform each of the plurality of operations. The compiler may determine an order of performance associated with the plurality of operations. The compiler may identify a runtime inefficiency based on the order of performance and a hardware usage for each of the plurality of operations. An operation may be reordered to reduce the runtime inefficiency. Instructions may be compiled based on the plurality of operations, which include the reordered operation.
Transforming loops in program code based on a capacity of a cache
An electronic device acquires, from program code, two or more program code loops having specified data dependencies. The electronic device places each of the program code loops into a corresponding blocking loop, each blocking loop including at least one blocking loop induction variable that is incremented by a corresponding block size and used to specify a number of iterations for at least one internal loop induction variable of the respective program code loop. The electronic device fuses the blocking loops into a fused loop by placing all of the blocking loops in the fused loop and replacing the blocking loop induction variables of the blocking loops with a fused loop induction variable that is incremented by the corresponding block size and used to specify the number of iterations for respective internal loop induction variables in the blocking loops.