Patent classifications
G06F9/383
Non-cached loads and stores in a system having a multi-threaded, self-scheduling processor
Representative apparatus, method, and system embodiments are disclosed for a self-scheduling processor which also provides additional functionality. Representative embodiments include a self-scheduling processor, comprising: a processor core adapted to execute instructions; and a core control circuit adapted to automatically schedule an instruction for execution by the processor core in response to a received work descriptor data packet. In a representative embodiment, the processor core is further adapted to execute a non-cached load instruction to designate a general purpose register rather than a data cache for storage of data received from a memory circuit. The core control circuit is also adapted to schedule a fiber create instruction for execution by the processor core, and to generate one or more work descriptor data packets to another circuit for execution of corresponding execution threads. Event processing, data path management, system calls, memory requests, and other new instructions are also disclosed.
METHOD AND APPARATUS FOR VECTOR SORTING USING VECTOR PERMUTATION LOGIC
A method for sorting of a vector in a processor is provided that includes performing, by the processor in response to a vector sort instruction, generating a control input vector for vector permutation logic comprised in the processor based on values in lanes of the vector and a sort order for the vector indicated by the vector sort instruction and storing the control input vector in a storage location.
Streaming engine with multi dimensional circular addressing selectable at each dimension
A streaming engine employed in a digital data processor may specify a fixed read-only data stream defined by plural nested loops. An address generator produces address of data elements for the nested loops. A steam head register stores data elements next to be supplied to functional units for use as operands. A stream template register independently specifies a linear address or a circular address mode for each of the nested loops.
STREAM REFERENCE REGISTER WITH DOUBLE VECTOR AND DUAL SINGLE VECTOR OPERATING MODES
A streaming engine employed in a digital signal processor specifies a fixed read only data stream. Once fetched the data stream is stored in two head registers for presentation to functional units in the fixed order. Data use by the functional unit is preferably controlled using the input operand fields of the corresponding instruction. A first read only operand coding supplies data from the first head register. A first read/advance operand coding supplies data from the first head register and also advances the stream to the next sequential data elements. Corresponding second read only operand coding and second read/advance operand coding operate similarly with the second head register. A third read only operand coding supplies double width data from both head registers.
DATA PROCESSING APPARATUS HAVING STREAMING ENGINE WITH READ AND READ/ADVANCE OPERAND CODING
A streaming engine employed in a digital signal processor specified a fixed data stream. Once started the data stream is read only and cannot be written. Once fetched the data stream is stored in a first-in-first-out buffer for presentation to functional units in the fixed order. Data use by the functional unit is controlled using the input operand fields of the corresponding instruction. A read only operand coding supplies the data an input of the functional unit. A read/advance operand coding supplies the data and also advances the stream to the next sequential data elements. The read only operand coding permits reuse of data without requiring a register of the register file for temporary storage.
Graphics processors and graphics processing units having dot product accumulate instruction for hybrid floating point format
Described herein is a graphics processing unit (GPU) comprising a first processing cluster to perform parallel processing operations, the parallel processing operations including a ray tracing operation and a matrix multiply operation; and a second processing cluster coupled to the first processing cluster, wherein the first processing cluster includes a floating-point unit to perform floating point operations, the floating-point unit is configured to process an instruction using a bfloat16 (BF16) format with a multiplier to multiply second and third source operands while an accumulator adds a first source operand with output from the multiplier.
Streaming engine with early and late address and loop count registers to track architectural state
A streaming engine employed in a digital data processor specifies a fixed read only data stream defined by plural nested loops. An address generator produces address of data elements. A steam head register stores data elements next to be supplied to functional units for use as operands. The streaming engine stores an early address of next to be fetched data elements and a late address of a data element in the stream head register for each of the nested loops. The streaming engine stores an early loop counts of next to be fetched data elements and a late loop counts of a data element in the stream head register for each of the nested loops.
VARIABLE POSITION SHIFT FOR MATRIX PROCESSING
An apparatus has matrix processing circuitry to perform a matrix processing operation on first and second input operands to generate a 2D result matrix; operand storage circuitry to store information for forming the first and second input operands for the matrix processing circuitry; and position shifting circuitry to apply a variable position shift to vary which row/column of the result matrix is updated based on a given element of one of the first and second input operands stored in the operand storage circuitry during a given matrix processing operation. The variable position shift is based on one of a plurality of alternative shift amounts, each alternative hift amount corresponding to a position shift of the one of the first and second input operands relative to the result matrix by a different umber of rows/columns. This is useful for performing 2D convolution operations.
System, apparatus, and method for a transient load instruction within a VLIW operation
A transient load instruction for a processor may include a transient or temporary load instruction that is executed in parallel with a plurality of input operands. The temporary load instruction loads a memory value into a temporary location for use within the instruction packet. According to some examples, a VLIW based microprocessor architecture may include a temporary cache for use in writing/reading a temporary memory value during a single VLIW packet cycle. The temporary cache is different from the normal register bank that does not allow writing and then reading the value just written during the same VLIW packet cycle.
METHOD AND APPARATUS TO SORT A VECTOR FOR A BITONIC SORTING ALGORITHM
A method is provided that includes performing, by a processor in response to a vector sort instruction, sorting of values stored in lanes of the vector to generate a sorted vector, wherein the values in a first portion of the lanes are sorted in a first order indicated by the vector sort instruction and the values in a second portion of the lanes are sorted in a second order indicated by the vector sort instruction; and storing the sorted vector in a storage location.