Patent classifications
G06F9/30141
Efficient inter-thread communication between hardware processing threads of a hardware multithreaded processor by selective aliasing of register blocks
A hardware multithreaded processor including a register file, a thread controller, and aliasing circuitry. The thread controller is configured to assign each of multiple hardware processing threads to a corresponding one of multiple register block sets in which each register block set includes at least two of multiple register blocks and in which each register block includes at least two registers. The aliasing circuitry is programmable to redirect a reference provided by a first hardware processing thread to a register of a register block assigned to a second hardware processing thread. The reference may be a register number in an instruction issued by the first hardware processing thread. The register number is converted by the aliasing circuitry to a register file address locating a register of the register block assigned to the second hardware processing thread. The aliasing circuitry may include a programmable register for one or more threads.
SYSTEM AND METHOD OF ROTATING VECTOR INPUT
A device includes a processor that includes a rotation vector register file, a second vector register file, and multiply-accumulate circuitry (MAC). The rotation vector register file includes a rotation vector register. The rotation vector register file is configured to rotate data in the rotation vector register. The second vector register file includes a source vector register. The MAC is configured to receive first input data from the rotation vector register file and second input data from the source vector register.
Vector Gather with a Narrow Datapath
Systems and methods are disclosed for vector gather with a narrow datapath. For example, some methods may include reading b bits of a vector of indices into a first operand buffer; reading b bits of the vector of source data into a second operand buffer, including an element indexed by a first index stored in the first operand buffer; checking whether other indices stored in the first operand buffer point to elements of the vector of source data stored in the second operand buffer; during a single clock cycle, copying a plurality of elements stored in the second operand buffer that are pointed to by indices stored in the first operand buffer to a third operand buffer; and updating flags in a completion flags buffer corresponding to those indices to indicate that handling of those indices has completed.
EXECUTION ELISION OF INTERMEDIATE INSTRUCTION BY PROCESSOR
A method for operation of a processor core is provided. First instruction data is consulted to determine whether a second instruction has execution data that matches the first instruction data. The first instruction data is from a first instruction. In response to determining that the second instruction has execution data that matches the first instruction data, prior data is copied into the second instruction. The first instruction depends on the prior data. After receiving an availability indication of the prior data, both the first instruction and the second instruction are woken for execution, without requiring execution of the first instruction before waking of the second instruction. The second instruction is executed by using the prior data as a skip of the first instruction. A computer system and a processor core configured to operate according to the method are also disclosed herein.
POWER EFFICIENT MULTI-BIT STORAGE SYSTEM
Disclosed herein are embodiments related to a power efficient multi-bit storage system. In one configuration, the multi-bit storage system includes a first storage circuit, a second storage circuit, a prediction circuit, and a clock gating circuit. In one aspect, the first storage circuit updates a first output bit according to a first input bit, in response to a trigger signal, and the second storage circuit updates a second output bit according to a second input bit, in response to the trigger signal. In one aspect, the prediction circuit generates a trigger enable signal indicating whether at least one of the first output bit or the second output bit is predicted to change a state. In one aspect, the clock gating circuit generates the trigger signal based on the trigger enable signal.
SPECIAL PURPOSE NEURAL NETWORK TRAINING CHIP
Methods, systems, and apparatus including a special purpose hardware chip for training neural networks are described. The special-purpose hardware chip may include a scalar processor configured to control computational operation of the special-purpose hardware chip. The chip may also include a vector processor configured to have a 2-dimensional array of vector processing units which all execute the same instruction in a single instruction, multiple-data manner and communicate with each other through load and store instructions of the vector processor. The chip may additionally include a matrix multiply unit that is coupled to the vector processor configured to multiply at least one two-dimensional matrix with a second one-dimensional vector or two-dimensional matrix in order to obtain a multiplication result.
Microprocessor including an efficiency logic unit
An example design structure tangibly embodied in a machine readable medium includes a first arithmetic logic unit (ALU) to perform fixed point instructions using at least two general registers to read data from a first and second general register of a plurality of general registers and write a result in at least a third general register of the plurality of general registers. The design structure includes a second ALU to perform non-updating fixed point instructions using at least two general registers to only read data from the general registers. The design structure includes an efficiency logic unit coupled to the first ALU and the second ALU. The efficiency logic unit is to receive an instruction and determine whether the received instruction is an updating fixed point instruction or a non-updating fixed point instruction based on a number of general registers to be used to execute the received instruction.
CIRCUIT FOR VERIFYING THE CONTENT OF REGISTERS
In accordance with an embodiment, a method verifies contents of a plurality of registers having two first registers, where each of the plurality of registers is configured to store a data word and a verification bit. The method includes determining whether a value of the verification bit of each respective register of the plurality of registers corresponds to the data word of its respective register. The data words stored in the two first registers are selected so that the bits of a same rank of the two first registers include two complementary bits, each bit of a common binary word is associated with a respective register of the plurality of registers, and the value of the verification bit of each respective register depends on the data word of the respective register and the bit of the common binary word associated with the respective register.
Loading operands and outputting results from a multi-dimensional array using only a single side
A computational array is implemented in which all operands and results are loaded or output from a single side of the array. The computational array comprises a plurality of cells arranged in n rows and m columns, each configured to produce a processed value based upon a weight value and an activation value. The cells receive weight and activation values via colinear weight and activation transmission channels that each extend across a first side edge of the computational array to provide weight values and activation values to the cells of the array. In addition, result values produced at a top cell of each of the m columns of the array are routed through the array to be output from the same first side edge of the array at a same relative timing at which the result values were produced.
COMPUTE TIME POINT PROCESSOR ARRAY FOR SOLVING PARTIAL DIFFERENTIAL EQUATIONS
Embodiments relate to a system for solving partial differential equations. The system receives a problem to be solved comprising a partial differential equation and a domain. A solver stores a plurality of nodes of the domain corresponding to a first time-step, and processes the nodes over a plurality of time-steps using an array of point processors. Each point processor comprises an ALU and a register file, and is configured to receive data corresponding to a respective node of a domain and generate a value for the node for a next time step, based upon instructions received over time via an instruction stream. Because all the data and computational requirements of the point processors are determined at compile time, the point processors do not need to perform any dynamic scheduling, allowing for a greater proportion of on-chip area to be allocated towards useful computation.