Patent classifications
G06F8/445
PARALLEL PROCESSING WITH SWITCH BLOCK EXECUTION
Techniques for parallel processing based on parallel processing with switch block execution are disclosed. An array of compute elements is accessed. Each compute element within the array of compute elements is known to a compiler and is coupled to its neighboring compute elements within the array of compute elements. Control for the compute elements is provided on a cycle-by-cycle basis. Control is enabled by a stream of wide control words generated by the compiler. A plurality of compute elements is initialized within the array with a switch statement. The switch statement is mapped into a primitive operation in each element of the plurality of compute elements. The initializing is based on a control word from the stream of control words. Each of the primitive operations is executed in an architectural cycle. A result is returned for the switch statement. The returning is determined by a decision variable.
PARALLEL PROCESSING USING HAZARD DETECTION AND MITIGATION
Techniques for parallel processing using hazard detection and mitigation are disclosed. An array of compute elements is accessed. Each compute element within the array of compute elements is known to a compiler and is coupled to its neighboring compute elements within the array of compute elements. Control for the compute elements is provided on a cycle-by-cycle basis. Control is enabled by a stream of wide control words generated by the compiler. Memory access operations are tagged with precedence information. The tagging is contained in the control words. The tagging is provided by the compiler at compile time. Memory access operations are monitored. The monitoring is based on the precedence information and a number of architectural cycles of the cycle-by-cycle basis. The tagging is augmented at run time, based on the monitoring. Memory access data is held before promotion, based on the monitoring.
Software optimization for multicore systems
A computer-implemented method and non-transitory computer readable medium for software optimization. The method comprises accessing program code having a plurality of software tasks and generating a first mapping of the software tasks to a plurality of processors of a target system having a plurality of physical communication links. A simulation of the target system is executed according to the first mapping. During the simulation, the software tasks cause data transfers over simulated communication links corresponding to the physical communication links. The data transfers are modeled in phases during the simulation and data rates of the simulated physical links are distributed across corresponding active data transfers. A second mapping of the software tasks to the plurality of processors is generated based on a result of the simulation.
Compiler architecture for programmable application specific integrated circuit based network devices
A processing network including a plurality of lookup and decision engines (LDEs) each having one or more configuration registers and a plurality of on-chip routers forming a matrix for routing the data between the LDEs, wherein each of the on-chip routers is communicatively coupled with one or more of the LDEs. The processing network further including an LDE compiler stored on a memory and communicatively coupled with each of the LDEs, wherein the LDE compiler is configured to generate values based on input source code that when programmed into the configuration registers of the LDEs cause the LDEs to implement the functionality defined by the input source code.
Systems and methods for automatic computer code parallelization
A system and method for automatic parallelization of computer code includes: measuring a performance of a computer program; identifying slow code of the computer program; implementing a computer code analysis of the computer program including: implementing a dependence analysis; implementing a side effect analysis of the computer program; constructing a dependency analysis basic block (DABB) graph for blocks of the code: a graphical representation of one or more possible paths through a respective disparate block of code; constructing a versioned dependency graph that optimizes a performance of the computer program; generating a metaprogram based on the versioned dependency graph; and automatically executing parallelization of the computer program at runtime based on the metaprogram.
TUNING OF LOOP ORDERS IN BLOCKED DENSE BASIC LINEAR ALGEBRA SUBROUTINES
An example includes a sequence generator to generate a plurality of sequence pairs, a first one of the sequence pairs including: (i) a first input sequence representing first accesses to first tensors in a first loop nest of a first computer program, and (ii) a first output sequence representing a first tuned loop nest corresponding to the first accesses to the first tensors in the first loop nest; a model trainer to train a recurrent neural network based on the sequence pairs as training data, the recurrent neural network to be trained to tune loop ordering of a second computer program based on a second input sequence representing second accesses to a second tensor in a second loop nest of the second computer program; and a memory interface to store, in memory, a trained model corresponding to the recurrent neural network.
General purpose distributed data parallel computing using a high level language
General-purpose distributed data-parallel computing using a high-level language is disclosed. Data parallel portions of a sequential program that is written by a developer in a high-level language are automatically translated into a distributed execution plan. The distributed execution plan is then executed on large compute clusters. Thus, the developer is allowed to write the program using familiar programming constructs in the high level language. Moreover, developers without experience with distributed compute systems are able to take advantage of such systems.
APPARATUS AND METHOD FOR EFFICIENTLY ACCESSING MEMORY WHEN PERFORMING A HORIZONTAL DATA REDUCTION
Apparatus and method for optimizing shader execution. For example, one embodiment of a graphics processing apparatus comprises: a plurality of execution units to execute shader programs; optimization detection circuitry and/or logic to identify one or more portions of shader program code to be optimized including one or more reduction operations which require read/write memory operations and associated barrier operations; and optimization circuitry and/or logic to optimize the shader program code by converting a plurality of the read/write memory operations to read/write register operations and removing one or more barrier operations to generate optimized shader program code; the execution units to execute the optimized shader program code.
TECHNIQUES FOR TRANSFORMING SERIAL PROGRAM CODE INTO KERNELS FOR EXECUTION ON A PARALLEL PROCESSOR
A compiler generates an accelerated version of a serial computer program that can be executed on a parallel processor. The compiler analyzes the serial computer program and generates a graph of nodes connected by edges. Each node corresponds to an operation or value set forth in the serial computer program. Each incoming edge corresponds to an operand that is specified or generated in the serial computer program. The compiler partitions the graph of nodes into two different types of partitions; a first type of partition includes one or more nodes that correspond to one or more pointwise operations, and a second type of partition includes one node that corresponds to one operation that is performed efficiently via a library. For each partition, the compiler configures a sequence of kernels that can be executed on the parallel processor to perform the operations associated with the computer program in an accelerated fashion.
Apparatus and method for efficiently accessing memory when performing a horizontal data reduction
Apparatus and method for optimizing shader execution. For example, one embodiment of a graphics processing apparatus comprises: a plurality of execution units to execute shader programs; optimization detection circuitry and/or logic to identify one or more portions of shader program code to be optimized including one or more reduction operations which require read/write memory operations and associated barrier operations; and optimization circuitry and/or logic to optimize the shader program code by converting a plurality of the read/write memory operations to read/write register operations and removing one or more barrier operations to generate optimized shader program code; the execution units to execute the optimized shader program code.