Patent classifications
G06F9/30069
Systolic multiply delayed accumulate processor architecture
Systems and methods are provided to perform multiplication-delayed-addition operations in a systolic array to increase clock speeds, reduce circuit area, and/or reduce dynamic power consumption. Each processing element in the systolic array can have a pipeline configured to perform a multiplication during a first systolic interval and to perform an accumulation during a second systolic interval. The multiplication result from the first systolic interval can be stored in a delay register for use by the accumulator during the second systolic interval. A skip detection circuit can be used to skip one or more of the multiplication, storing in the delay register, and the addition during skip conditions for improved energy efficiency.
Optimized trampoline design for fast software tracing
Tracing computer software program execution includes copying a software instruction at an instrumentation point within an original instruction stream, and replacing the software instruction with a jump instruction. The jump instruction branches to a multi-level trampoline that includes at least a first-level trampoline specific to an associated software tracing probe, and a second-level trampoline generic to plural software tracing probes. The first-level trampoline preserves partial CPU state and branches to the second-level trampoline, passing it software tracing probe identifying information. The second-level trampoline preserves a remainder of the CPU state, implements software tracing operations in accordance with the software tracing probe, restores the CPU state that it previously preserved, and returns program control to the first-level trampoline. Either the first-level or second-level trampoline may execute or emulate the original instruction. The first-level trampoline restores the CPU state that it previously preserved, and returns program control to the original instruction stream.
SYSTEMS AND METHODS TO SKIP INCONSEQUENTIAL MATRIX OPERATIONS
Disclosed embodiments relate to systems and methods to skip inconsequential matrix operations. In one example, a processor includes decode circuitry to decode an instruction having fields to specify an opcode and locations of first source, second source, and destination matrices, the opcode indicating that the processor is to multiply each element at row M and column K of the first source matrix with a corresponding element at row K and column N of the second source matrix, and accumulate a resulting product with previous contents of a corresponding element at row M and column N of the destination matrix, the processor to skip multiplications that, based on detected values of corresponding multiplicands, would generate inconsequential results, scheduling circuitry to schedule execution of the instruction; and execution circuitry to execute the instructions as per the opcode.
Structured Weight Based Sparsity In An Artificial Neural Network
A novel and useful system and method of improved power performance and lowered memory requirements for an artificial neural network based on packing memory utilizing several structured sparsity mechanisms. The invention applies to neural network (NN) processing engines adapted to implement mechanisms to search for structured sparsity in weights and activations, resulting in a considerably reduced memory usage. The sparsity guided training mechanism synthesizes and generates structured sparsity weights A compiler mechanism within a software development kit (SDK), manipulates structured weight domain sparsity to generate a sparse set of static weights for the NN. The structured sparsity static weights are loaded into the NN after compilation and utilized by both the structured weight domain sparsity mechanism and the structured activation domain sparsity mechanism. The application of structured sparsity lowers the span of search options and creates a relatively loose coupling between the data and control planes.
SKIP-OVER OFFSET BRANCH PREDICTION
A system includes a branch predictor and a processing circuit configured to perform a plurality of operations including storing a skip-over offset value in the branch predictor. The skip-over offset value defines a number of search addresses of the branch predictor to be skipped. The operations further include searching the branch predictor for a branch prediction. Responsive to finding the branch prediction, the searching of the branch predictor is re-indexed based on the skip-over offset value associated with the branch prediction.
HYBRID AND EFFICIENT APPROACH TO ACCELERATE COMPLICATED LOOPS ON COARSE-GRAINED RECONFIGURABLE ARRAYS (CGRA) ACCELERATORS
A coarse-grained reconfigurable array includes a processing element array, instruction memory circuitry, data memory circuitry, and an instruction fetch unit. The processing element array includes a number of processing elements. The instruction memory circuitry is coupled to the processing element array and configured to store a set of instructions. During each one of a number of processing cycles, the instruction memory circuitry provides instructions from the set of instructions to the processing elements. The instruction fetch unit is coupled to the processing element array and the instruction memory circuitry and configured to receive a result of a conditional instruction evaluated by one of the processing elements and provide the instruction fetch signals based at least in part on the result of the conditional instruction such that only instructions associated with a correct branch of the conditional instruction are provided to the plurality of processing elements.
Optimized Trampoline Design For Fast Software Tracing
Tracing computer software program execution includes copying a software instruction at an instrumentation point within an original instruction stream, and replacing the software instruction with a jump instruction. The jump instruction branches to a multi-level trampoline that includes at least a first-level trampoline specific to an associated software tracing probe, and a second-level trampoline generic to plural software tracing probes. The first-level trampoline preserves partial CPU state and branches to the second-level trampoline, passing it software tracing probe identifying information. The second-level trampoline preserves a remainder of the CPU state, implements software tracing operations in accordance with the software tracing probe, restores the CPU state that it previously preserved, and returns program control to the first-level trampoline. Either the first-level or second-level trampoline may execute or emulate the original instruction. The first-level trampoline restores the CPU state that it previously preserved, and returns program control to the original instruction stream.
Dependency skipping in a load-compare-jump sequence of instructions by incorporating compare functionality into the jump instruction and auto-finishing the compare instruction
A method of performing instructions in a computer processor architecture includes determining that a load instruction is being dispatched. Destination related data of the load instruction is written into a mapper of the architecture. A determination that a compare immediate instruction is being dispatched is made. A determination that a branch conditional instruction is being dispatched is made. The branch conditional instruction is configured to wait until the load instruction produces a result before the branch conditional instruction issues and executes. The branch conditional instruction skips waiting for a finish of the compare immediate instruction.
Method performed by a microcontroller for managing a NOP instruction and corresponding microcontroller
Disclosed herein is a method for managing of NOP instructions in a microcontroller, the method comprising duplicating all jump instructions causing a NOP instruction to form a new instruction set; inserting an internal NOP instruction into each of the jump instructions; when a jump instruction is executed, executing a subsequent instruction of the new instruction set; and executing the internal NOP instruction when an execution of the subsequent instruction is skipped.
Systems and methods to skip inconsequential matrix operations
Disclosed embodiments relate to systems and methods to skip inconsequential matrix operations. In one example, a processor includes decode circuitry to decode an instruction having fields to specify an opcode and locations of first source, second source, and destination matrices, the opcode indicating that the processor is to multiply each element at row M and column K of the first source matrix with a corresponding element at row K and column N of the second source matrix, and accumulate a resulting product with previous contents of a corresponding element at row M and column N of the destination matrix, the processor to skip multiplications that, based on detected values of corresponding multiplicands, would generate inconsequential results; scheduling circuitry to schedule execution of the instruction; and execution circuitry to execute the instructions as per the opcode.