Patent classifications
G06F9/30032
SYSTEMS AND METHODS FOR EXECUTING A PROGRAMMABLE FINITE STATE MACHINE THAT ACCELERATES FETCHLESS COMPUTATIONS AND OPERATIONS OF AN ARRAY OF PROCESSING CORES OF AN INTEGRATED CIRCUIT
Systems and methods for fetchless acceleration of convolutional loops on an integrated circuit include identifying, by a compiler, finite state machine (FSM) initialization parameters based on computational requirements of a computational loop; initializing a programmable FSM based on the FSM initialization parameters, wherein the FSM initialization parameters include a loop iteration parameter identifying a number of computation cycles of the computational loop; executing the programmable FSM to enable fetchless computations by: generating a plurality of computational loop control signals including a distinct computation loop control signal for each of the number of computation cycles of the computational loop based on the loop iteration parameter; and controlling an execution of a plurality of computation cycles of a computational circuit performing the computational loop based on transmitting the plurality of computational loop control signals until the number of computation cycles of the computation loop are completed.
Arithmetic unit for deep learning acceleration
Embodiments of a device include an integrated circuit, a reconfigurable stream switch formed in the integrated circuit, and an arithmetic unit coupled to the reconfigurable stream switch. The arithmetic unit has a plurality of inputs and at least one output, and the arithmetic unit is solely dedicated to performance of a plurality of parallel operations. Each one of the plurality of parallel operations carries out a portion of the formula: output=AX+BY+C.
Modular gated multiplier circuitry and multiplication technique
Various implementations described herein are related to a device having multiplier circuitry with an array of summation result cells that holds summation bit values for shifted arrays added together. The device may include latch circuitry having one or more gated elements disposed between the summation result cells, and the gated elements may be adapted to provide a portion of the summation bit values based on a gating signal.
Random number generator, random number generating circuit, and random number generating method
A random number generator, a random number generating circuit, and a random number generating method are provided. The random number generating circuit includes the random number generator and executes the random number generating method. The random number generator includes a shift register having N storage elements and a combinational logic circuit. The N storage elements receive a random seed in a static state and repetitively perform a bit shift operation in a plurality of clock cycles. The combinational logic circuit generates an output sequence based on the random seed and a random bitstream received from an external source.
System and method for convolving image with sparse kernels
An image processing system for convolving an image includes processing circuitry that is configured to retrieve the image including a set of rows, a merged kernel, multiple skip values and a pixel base address. The merged kernel includes all non-zero coefficients of a set of kernels. Each skip value corresponds to a location offset of each non-zero coefficient with respect to a previous non-zero coefficient. Further, the processing circuitry is configured to execute a multiply-accumulate (MAC) instruction and a load instruction parallelly in one clock cycle for multiple times, on the set of rows and the merged kernel to convolve the image with the merged kernel. Each row on which the MAC and load instructions are executed is associated with a corresponding non-zero coefficient and a corresponding skip value. The load instruction is executed based on the pixel base address, the corresponding skip value, and a width of each row.
Low latency matrix multiply unit
Methods, systems, and apparatus for a matrix multiply unit implemented as a systolic array of cells are disclosed. The matrix multiply unit may include cells arranged in columns of the systolic array. Two chains of weight shift registers per column of the systolic array are in the matrix multiply unit. Each weight shift register is connected to only one chain and each cell is connected to only one weight shift register. A weight matrix register per cell is configured to store a weight input received from a weight shift register. A multiply unit is coupled to the weight matrix register and configured to multiply the weight input of the weight matrix register with a vector data input in order to obtain a multiplication result.
PROCESSOR WITH TABLE LOOKUP UNIT
A processor includes a scalar processor core and a vector coprocessor core coupled to the scalar processor core. The scalar processor core is configured to retrieve an instruction stream from program storage, and pass vector instructions in the instruction stream to the vector coprocessor core. The vector coprocessor core includes a register file, a plurality of execution units, and a table lookup unit. The register file includes a plurality of registers. The execution units are arranged in parallel to process a plurality of data values. The execution units are coupled to the register file. The table lookup unit is coupled to the register file in parallel with the execution units. The table lookup unit is configured to retrieve table values from one or more lookup tables stored in memory by executing table lookup vector instructions in a table lookup loop.
SYSTEM AND METHOD FOR PERFORMING COMPUTATIONS FOR DEEP NEURAL NETWORKS
A computation unit for performing a computation of a neural network layer is disclosed. A number of processing element (PE) units are arranged in an array. First input values are provided in parallel in an input dimension of the array during a first processing period, and a second input values are provided in parallel in the input dimension during a second processing period. Computations are performed by the PE units based on stored weight values. An adder coupled to the first set of PE units generates a first sum of results of the computations by the first set of PE units during the first processing cycle, and generates a second sum of results of the computations during the second processing cycle. A first accumulator coupled to the first adder stores the first sum, and further shifts the first sum to a second accumulator prior to storing the second sum.
Programmable slave circuit on a communication bus
A programmable slave circuit on a communication bus is provided. In a non-limiting example, the communication bus can be a radio frequency front-end (RFFE) bus operating based on a master-slave topology and the programmable slave circuit can be an RFFE slave circuit on the RFFE bus. The programmable slave circuit is configured to receive a high-level command(s) (e.g., a macro word) over the communication bus. A processing circuit in the programmable slave circuit is programmed to generate a low-level command(s) (e.g., a bitmap word) for controlling a coupled circuit(s) based on the high-level command(s). In this regard, it is possible to program or reprogram the processing circuit, for example via over-the-air (OTA) updates, based on the high-level command(s) to be supported, thus making it possible to flexibly customize the programmable slave circuit according to operating requirements and configurations.
LOOK-UP TABLE READ
A digital data processor includes an instruction memory storing instructions specifying data processing operations and a data operand field, an instruction decoder coupled to the instruction memory for recalling instructions from the instruction memory and determining the operation and the data operand, and an operational unit coupled to a data register file and an instruction decoder to perform an operation upon an operand corresponding to an instruction decoded by the instruction decoder and storing results of the operation. The operational unit is configured to perform a table recall in response to a look up table read instruction by recalling data elements from a specified location and adjacent location to the specified location, in a specified number of at least one table and storing the recalled data elements in successive slots in a destination register. Recalled data elements include at least one interpolated data element in the adjacent location.