G06F9/30018

VARIABLE POSITION SHIFT FOR MATRIX PROCESSING
20230229730 · 2023-07-20 ·

An apparatus has matrix processing circuitry to perform a matrix processing operation on first and second input operands to generate a 2D result matrix; operand storage circuitry to store information for forming the first and second input operands for the matrix processing circuitry; and position shifting circuitry to apply a variable position shift to vary which row/column of the result matrix is updated based on a given element of one of the first and second input operands stored in the operand storage circuitry during a given matrix processing operation. The variable position shift is based on one of a plurality of alternative shift amounts, each alternative hift amount corresponding to a position shift of the one of the first and second input operands relative to the result matrix by a different umber of rows/columns. This is useful for performing 2D convolution operations.

Vector computational unit receiving data elements in parallel from a last row of a computational array

A microprocessor system comprises a vector computational unit and a control unit. The vector computational unit includes a plurality of processing elements. The control unit is configured to provide at least a single processor instruction to the vector computational unit. The single processor instruction specifies a plurality of component instructions to be executed by the vector computational unit in response to the single processor instruction and each of the plurality of processing elements of the vector computational unit is configured to process different data elements in parallel with other processing elements in response to the single processor instruction.

METHOD AND APPARATUS TO SORT A VECTOR FOR A BITONIC SORTING ALGORITHM
20230229448 · 2023-07-20 ·

A method is provided that includes performing, by a processor in response to a vector sort instruction, sorting of values stored in lanes of the vector to generate a sorted vector, wherein the values in a first portion of the lanes are sorted in a first order indicated by the vector sort instruction and the values in a second portion of the lanes are sorted in a second order indicated by the vector sort instruction; and storing the sorted vector in a storage location.

Indexing external memory in a reconfigurable compute fabric
11704130 · 2023-07-18 · ·

Various examples are directed to systems and methods in which a flow controller of a first synchronous flow may receive an instruction to execute a first loop using the first synchronous flow. The flow controller may determine a first iteration index for a first iteration of the first loop. The flow controller may send, to a first compute element of the first synchronous flow, a first synchronous message to initiate a first synchronous flow thread for executing the first iteration of the first loop. The first synchronous message may comprise the iteration index. The first compute element may execute an input/output operation at a first location of a first compute element memory indicated by the first iteration index.

NON-CRYPTOGRAPHIC HASHING USING CARRY-LESS MULTIPLICATION
20230015000 · 2023-01-19 ·

Non-cryptographic hashing using carry-less multiplication and associated methods, software, and apparatus. Under one aspect, the disclosed hash solution expands on CRC technology that updates a polynomial expansion and final reduction, to use initialization (init), update and finalize stages with extended seed values. The hash solutions operate on input data partitioned into multiple blocks comprising sequences of byte data, such as ASCII characters. During multiple rounds of an update stage, operations are performed on sub-blocks of a given block in parallel including carry-less multiplication and shuffle operations. During a finalize stage, multiple SHA or carry-less multiplication operations are performed on data output following a final round of the update stage.

MASKING ROW OR COLUMN POSITIONS FOR MATRIX PROCESSING
20230214236 · 2023-07-06 ·

An apparatus comprises matrix processing circuitry to perform a matrix processing operation on first and second input operands to generate a result matrix, where the result matrix is a two-dimensional matrix; operand storage circuitry to store information for forming the first and second input operands for the matrix processing circuitry; and masking circuitry to perform a masking operation to mask at least part of the matrix processing operation or the information stored to the operand storage circuitry based on masking state data indicative of one or more masked row or column positions to be treated as representing a masking value. This is useful for improving performance of two-dimensional convolution operations, as the masking can be used to mask out selected rows or columns when performing the 2D convolution as a series of 1×1 convolution operations applied to different kernel positions.

System and method to control the number of active vector lanes in a processor

In one disclosed embodiment, a processor includes a first execution unit and a second execution unit, a register file, and a data path including a plurality of lanes. The data path and the register file are arranged so that writing to the register file by the first execution unit and by the second execution unit is allowed over the data path, reading from the register file by the first execution unit is allowed over the data path, and reading from the register file by the second execution unit is not allowed over the data path. The processor also includes a power control circuit configured to, when a transfer of data between the register file and either of the first and second execution units uses less than all of the lanes, power down the lanes of the data path not used for the transfer of the data.

Generating a vector predicate summary
11550574 · 2023-01-10 · ·

Apparatuses and methods of operating such apparatuses are disclosed. Vector processing circuitry performs data processing in multiple parallel processing lanes, wherein the data processing is performed in a subset of the multiple parallel processing lanes determined by bit values of a vector predicate which are set. Predicate monitoring circuitry is responsive to the vector predicate to generate a predicate summary value in dependence on the bit values of the vector predicate. A first value of the predicate summary value indicates that a sparse condition is true for the vector predicate, the sparse condition being true when the bit values of the vector predicate comprise a set bit corresponding to a vector element at a higher index immediately followed by a non-set bit corresponding to a vector element at a lower index. A second value of the predicate summary value indicates that the sparse condition is not true for the vector predicate. Improved predicate controlled vector processing is thus supported.

ADDITION MASK VALUE GENERATOR, ENCRYPTOR AND METHOD FOR GENERATING STREAM KEY
20230214216 · 2023-07-06 ·

An addition mask value generator is provided. A first operation circuit is configured to obtain first intermediate data according to first output mask value and fourth output mask value. A second operation circuit is configured to obtain the addition output mask value of a first mask group according to first intermediate data and fourth input mask value. A third operation circuit is configured to obtain second intermediate data according to second output mask value and third output mask value. A fourth operation circuit is configured to obtain the addition output mask value of a second mask group according to second intermediate data and second input mask value. The first and second addition input mask values of first mask group are first and second input mask values. The first and second addition input mask values of second mask group are third input mask value and first intermediate data.

Responding to branch misprediction for predicated-loop-terminating branch instruction

A predicated-loop-terminating branch instruction controls, based on whether a loop termination condition is satisfied, whether the processing circuitry should process a further iteration of a predicated loop body or process a following instruction. If at least one unnecessary iteration of the predicated loop body is processed following a mispredicted-non-termination branch misprediction when the loop termination condition is mispredicted as unsatisfied for a given iteration when it should have been satisfied, processing of the at least one unnecessary iteration of the predicated loop body is predicated to suppress an effect of the at least one unnecessary iteration. When the mispredicted-non-termination branch misprediction is detected for the given iteration of the predicated-loop-terminating branch instruction, in response to determining that a flush suppressing condition is satisfied, flushing of the at least one unnecessary iteration of the predicated loop body is suppressed as a response to the mispredicted-non-termination branch misprediction.