G06F9/30032

Instruction execution that broadcasts and masks data values at different levels of granularity

An apparatus is described that includes an execution unit to execute a first instruction and a second instruction. The execution unit includes input register space to store a first data structure to be replicated when executing the first instruction and to store a second data structure to be replicated when executing the second instruction. The first and second data structures are both packed data structures. Data values of the first packed data structure are twice as large as data values of the second packed data structure. The execution unit also includes replication logic circuitry to replicate the first data structure when executing the first instruction to create a first replication data structure, and, to replicate the second data structure when executing the second data instruction to create a second replication data structure. The execution unit also includes masking logic circuitry to mask the first replication data structure at a first granularity and mask the second replication data structure at a second granularity. The second granularity is twice as fine as the first granularity.

VARIABLE POSITION SHIFT FOR MATRIX PROCESSING
20230229730 · 2023-07-20 ·

An apparatus has matrix processing circuitry to perform a matrix processing operation on first and second input operands to generate a 2D result matrix; operand storage circuitry to store information for forming the first and second input operands for the matrix processing circuitry; and position shifting circuitry to apply a variable position shift to vary which row/column of the result matrix is updated based on a given element of one of the first and second input operands stored in the operand storage circuitry during a given matrix processing operation. The variable position shift is based on one of a plurality of alternative shift amounts, each alternative hift amount corresponding to a position shift of the one of the first and second input operands relative to the result matrix by a different umber of rows/columns. This is useful for performing 2D convolution operations.

Vector computational unit receiving data elements in parallel from a last row of a computational array

A microprocessor system comprises a vector computational unit and a control unit. The vector computational unit includes a plurality of processing elements. The control unit is configured to provide at least a single processor instruction to the vector computational unit. The single processor instruction specifies a plurality of component instructions to be executed by the vector computational unit in response to the single processor instruction and each of the plurality of processing elements of the vector computational unit is configured to process different data elements in parallel with other processing elements in response to the single processor instruction.

METHOD AND APPARATUS TO SORT A VECTOR FOR A BITONIC SORTING ALGORITHM
20230229448 · 2023-07-20 ·

A method is provided that includes performing, by a processor in response to a vector sort instruction, sorting of values stored in lanes of the vector to generate a sorted vector, wherein the values in a first portion of the lanes are sorted in a first order indicated by the vector sort instruction and the values in a second portion of the lanes are sorted in a second order indicated by the vector sort instruction; and storing the sorted vector in a storage location.

Marking current context data to control a context-data-dependent processing operation to save current or default context data to a data location

A data processing system includes processing circuitry for executing context-data-dependent program instructions which are decoded by decoder circuitry. Such context-data-dependent program instructions perform processing which is dependent upon currently existing context data. As an example, the context-data-dependent program instructions may be floating point instructions and the context data may be rounding mode information. The decoder circuitry supports a context save instruction which saves context data when it is marked as having been used and saves default context data when the current context data is marked as not having been used. The decoder circuitry further supports a context restore instruction which restores context data when the current context data is marked as having been used and permits the current context data to continue for future use when it is marked as currently unused.

VECTOR BIT TRANSPOSE

A method to transpose source data in a processor in response to a vector bit transpose instruction includes specifying, in respective fields of the vector bit transpose instruction, a source register containing the source data and a destination register to store transposed data. The method also includes executing the vector bit transpose instruction by interpreting N×N bits of the source data as a two-dimensional array having N rows and N columns, creating transposed source data by transposing the bits by reversing a row index and a column index for each bit, and storing the transposed source data in the destination register.

MASKING ROW OR COLUMN POSITIONS FOR MATRIX PROCESSING
20230214236 · 2023-07-06 ·

An apparatus comprises matrix processing circuitry to perform a matrix processing operation on first and second input operands to generate a result matrix, where the result matrix is a two-dimensional matrix; operand storage circuitry to store information for forming the first and second input operands for the matrix processing circuitry; and masking circuitry to perform a masking operation to mask at least part of the matrix processing operation or the information stored to the operand storage circuitry based on masking state data indicative of one or more masked row or column positions to be treated as representing a masking value. This is useful for improving performance of two-dimensional convolution operations, as the masking can be used to mask out selected rows or columns when performing the 2D convolution as a series of 1×1 convolution operations applied to different kernel positions.

DATA MOVING METHOD, DIRECT MEMORY ACCESS APPARATUS AND COMPUTER SYSTEM
20230214338 · 2023-07-06 ·

A data moving method, a direct memory access apparatus, and a computer system are disclosed. The data moving method is used for a neural-network processor, the neural-network processor includes at least one processing unit array, and the method includes: receiving a first instruction, wherein the first instruction indicates address information of target data to be moved, and the address information of the target data is obtained based on a mapping relationship between the target data and at least one processing unit in the processing unit array; generating a data moving request according to the address information of the target data; and moving the target data for the neural-network processor, according to the data moving request.

Arithmetic processing device having multicore ring bus structure with turn-back bus for handling register file push/pull requests
11550576 · 2023-01-10 · ·

An arithmetic processing device includes arithmetic processing units, each having a calculator unit; a scheduler that controls a push instruction to write data to a register file in one of the arithmetic processing units and a pull instruction to read data from the register file; a pull request bus to which the scheduler outputs a pull request and which is connected to the arithmetic processing units; a push request bus to which the scheduler outputs a push request and which is connected to the arithmetic processing units; and a pull data bus that inputs, into the scheduler, pull data read from the register file in response to the pull request. Each of the arithmetic processing units includes a pull data turn-back bus that propagates pull data read from its register file to the pull data bus.

ADDITION MASK VALUE GENERATOR, ENCRYPTOR AND METHOD FOR GENERATING STREAM KEY
20230214216 · 2023-07-06 ·

An addition mask value generator is provided. A first operation circuit is configured to obtain first intermediate data according to first output mask value and fourth output mask value. A second operation circuit is configured to obtain the addition output mask value of a first mask group according to first intermediate data and fourth input mask value. A third operation circuit is configured to obtain second intermediate data according to second output mask value and third output mask value. A fourth operation circuit is configured to obtain the addition output mask value of a second mask group according to second intermediate data and second input mask value. The first and second addition input mask values of first mask group are first and second input mask values. The first and second addition input mask values of second mask group are third input mask value and first intermediate data.