G06F9/3013

Circuits and methods for vector sorting in a microprocessor
11593106 · 2023-02-28 · ·

Vector sort circuits that can be used to accelerate sorting operations in a vector processor. When a new data element is received, the vector sort circuit can read multiple existing data elements from a vector-sort database in parallel, compare metrics of the existing data elements to a metric of the new data element, and output updated data elements to the vector-sort database based on the metrics. Depending on implementation, the vector-sort database can be maintained in sorted order, or the data elements can have assigned ranks indicating the sort order and the elements need not be stored in sorted order. A vector sort circuit can be incorporated into a vector sort functional unit of a microprocessor, and the instruction set of the microprocessor can include instructions that are executed by the vector sort functional unit using the vector sort circuit.

Streaming engine with multi dimensional circular addressing selectable at each dimension
11709779 · 2023-07-25 · ·

A streaming engine employed in a digital data processor may specify a fixed read-only data stream defined by plural nested loops. An address generator produces address of data elements for the nested loops. A steam head register stores data elements next to be supplied to functional units for use as operands. A stream template register independently specifies a linear address or a circular address mode for each of the nested loops.

REGISTER PRESSURE TARGET FUNCTION SPLITTING
20230029183 · 2023-01-26 ·

Provided are embodiments for a method of performing register pressure targeted function splitting. The method can include determining a candidate region of a function, the candidate region comprising variables, and determining a number of available registers in a computing system for allocating the variables of the function. The method can also include grouping the variables in the candidate region into first variables and second variables based at least in part on the number of available registers, and splitting the candidate region of the function into split functions based at least in part on the grouping of the variables. Also provided are embodiments for a computer program product and a system for performing register pressure targeted function splitting

Vector computational unit receiving data elements in parallel from a last row of a computational array

A microprocessor system comprises a vector computational unit and a control unit. The vector computational unit includes a plurality of processing elements. The control unit is configured to provide at least a single processor instruction to the vector computational unit. The single processor instruction specifies a plurality of component instructions to be executed by the vector computational unit in response to the single processor instruction and each of the plurality of processing elements of the vector computational unit is configured to process different data elements in parallel with other processing elements in response to the single processor instruction.

Marking current context data to control a context-data-dependent processing operation to save current or default context data to a data location

A data processing system includes processing circuitry for executing context-data-dependent program instructions which are decoded by decoder circuitry. Such context-data-dependent program instructions perform processing which is dependent upon currently existing context data. As an example, the context-data-dependent program instructions may be floating point instructions and the context data may be rounding mode information. The decoder circuitry supports a context save instruction which saves context data when it is marked as having been used and saves default context data when the current context data is marked as not having been used. The decoder circuitry further supports a context restore instruction which restores context data when the current context data is marked as having been used and permits the current context data to continue for future use when it is marked as currently unused.

VECTOR BIT TRANSPOSE

A method to transpose source data in a processor in response to a vector bit transpose instruction includes specifying, in respective fields of the vector bit transpose instruction, a source register containing the source data and a destination register to store transposed data. The method also includes executing the vector bit transpose instruction by interpreting N×N bits of the source data as a two-dimensional array having N rows and N columns, creating transposed source data by transposing the bits by reversing a row index and a column index for each bit, and storing the transposed source data in the destination register.

System and method to control the number of active vector lanes in a processor

In one disclosed embodiment, a processor includes a first execution unit and a second execution unit, a register file, and a data path including a plurality of lanes. The data path and the register file are arranged so that writing to the register file by the first execution unit and by the second execution unit is allowed over the data path, reading from the register file by the first execution unit is allowed over the data path, and reading from the register file by the second execution unit is not allowed over the data path. The processor also includes a power control circuit configured to, when a transfer of data between the register file and either of the first and second execution units uses less than all of the lanes, power down the lanes of the data path not used for the transfer of the data.

Program event recording storage alteration processing for a neural network accelerator instruction

Instruction processing is performed for an instruction. The instruction is configured to perform a plurality of functions, in which a function of the plurality of functions is to be performed in a plurality of processing phases. A processing phase is defined to store up to a select amount of data. The select amount of data is based on the function to be performed. At least one function of the plurality of functions has a different value for the select amount of data than at least one other function. A determination is made as to whether a store into a designated area occurred based on processing a select processing phase of a select function. Based on determining that the store into the designated area occurred, an interrupt is presented, and based on determining that the store into the designated area did not occur, instruction processing is continued.

Monolithic vector processor configured to operate on variable length vectors using a vector length register

A computer processor comprising a vector unit is disclosed. The vector unit may comprise a vector register file comprising at least one register to hold a varying number of elements. The vector unit may further comprise a vector length register file comprising at least one register to specify the number of operations of a vector instruction to be performed on the varying number of elements in the at least one register of the vector register file. The computer processor may be implemented as a monolithic integrated circuit.

Bit width reconfiguration using a shadow-latch configured register file

A processor includes a front-end with an instruction set that operates at a first bit width and a floating point unit coupled to receive the instruction set in the processor that operates at the first bit width. The floating point unit operates at a second bit width and, based upon a bit width assessment of the instruction set provided to the floating point unit, the floating point unit employs a shadow-latch configured floating point register file to perform bit width reconfiguration. The shadow-latch configured floating point register file includes a plurality of regular latches and a plurality of shadow latches for storing data that is to be either read from or written to the shadow latches. The bit width reconfiguration enables the floating point unit that operates at the second bit width to operate on the instruction set received at the first bit width.