Patent classifications
G06F9/3001
Systems, methods, and apparatuses for tile store
Embodiments detailed herein relate to matrix operations. In particular, the loading of a matrix (tile) from memory. For example, support for a loading instruction is described in at least a form of decode circuitry to decode an instruction having fields for an opcode, a source matrix operand identifier, and destination memory information, and execution circuitry to execute the decoded instruction to store each data element of configured rows of the identified source matrix operand to memory based on the destination memory information.
ZERO OPERAND INSTRUCTION CONVERSION FOR ACCELERATING SPARSE COMPUTATIONS IN A CENTRAL PROCESSING UNIT PIPELINE
A processing device includes a zero detection circuit to determine that an operand of a first instruction is zero and instruction conversion logic coupled with the zero detection circuit to, in response to the zero detection circuit determining that the operand is zero, convert the first instruction to a register move instruction executable by the processing device.
VARIABLE POSITION SHIFT FOR MATRIX PROCESSING
An apparatus has matrix processing circuitry to perform a matrix processing operation on first and second input operands to generate a 2D result matrix; operand storage circuitry to store information for forming the first and second input operands for the matrix processing circuitry; and position shifting circuitry to apply a variable position shift to vary which row/column of the result matrix is updated based on a given element of one of the first and second input operands stored in the operand storage circuitry during a given matrix processing operation. The variable position shift is based on one of a plurality of alternative shift amounts, each alternative hift amount corresponding to a position shift of the one of the first and second input operands relative to the result matrix by a different umber of rows/columns. This is useful for performing 2D convolution operations.
Execution unit
An execution unit comprising a processing pipeline configured to perform calculations to evaluate a plurality of mathematical functions. The processing pipeline comprises a plurality of stages through which each calculation for evaluating a mathematical function progresses to an end result. Each of a plurality of processing circuits in the pipeline is configured to perform an operation on input values during at least one stage of the plurality of stages. The plurality of processing circuits include multiplier circuits. A first multiplier circuit and a second multiplier circuit are configured to operate in parallel, such that at the same stage in the processing pipeline, the first multiplier circuit and the second multiplier circuit perform their processing. A third multiplier circuit is arranged in series with the first multiplier circuit and the second multiplier circuit and processes outputs from the first multiplier circuit and the second multiplier circuit.
Vector computational unit receiving data elements in parallel from a last row of a computational array
A microprocessor system comprises a vector computational unit and a control unit. The vector computational unit includes a plurality of processing elements. The control unit is configured to provide at least a single processor instruction to the vector computational unit. The single processor instruction specifies a plurality of component instructions to be executed by the vector computational unit in response to the single processor instruction and each of the plurality of processing elements of the vector computational unit is configured to process different data elements in parallel with other processing elements in response to the single processor instruction.
METHOD AND APPARATUS TO SORT A VECTOR FOR A BITONIC SORTING ALGORITHM
A method is provided that includes performing, by a processor in response to a vector sort instruction, sorting of values stored in lanes of the vector to generate a sorted vector, wherein the values in a first portion of the lanes are sorted in a first order indicated by the vector sort instruction and the values in a second portion of the lanes are sorted in a second order indicated by the vector sort instruction; and storing the sorted vector in a storage location.
COMPUTATIONAL MEMORY
An example device includes a plurality of computational memory banks. Each computational memory bank of the plurality of computational memory banks includes an array of memory units and a plurality of processing elements connected to the array of memory units. The device further includes a plurality of single instruction, multiple data (SIMD) controllers. Each SIMD controller of the plurality of SIMD controllers is contained within at least one computational memory bank of the plurality of computational memory banks. Each SIMD controller is to provide instructions to the at least one computational memory bank.
Fracturable Data Path in a Reconfigurable Data Processor
A coarse-grained reconfigurable (CGR) processor includes a configurable unit comprising a fracturable data path with a plurality of sub-paths. The fracturable data path includes multiple stages that each include an arithmetic logic unit (ALU), selection logic to select two or more inputs for the ALU, and sub-path pipeline registers. The fracturable data path also includes a first output configurable to provide first data selected from any one of the sub-path pipeline registers and a second output configurable to provide second data selected from any one of the sub-path pipeline registers. The configurable unit includes a configuration store to store configuration data to provide a two or more immediate data fields for each stage of the fracturable data path and configuration information for the ALUs, the selection logic, and to select the first data and the second data for the first output and the second output.
Accelerator for dense and sparse matrix computations
A method of increasing computer hardware efficiency of a matrix computation. The method comprises receiving at a computer processing device, digital signals encoding one or more operations of the matrix computation, each operation including one or more operands. The method further comprises, responsive to determining, by a sparse data check device of the computer processing machine, that an operation of the matrix computation includes all dense operands, forwarding the operation to a dense computation device of the computer processing machine configured to perform the operation of the matrix computation based on the dense operands. The method further comprises, responsive to determining, by the sparse data check device, that an operation of the matrix computation includes one or more sparse operands, forwarding the operation to a sparse computation device configured to perform the operation of the matrix computation.
Apparatuses and methods for approximating nonlinear function
The present disclosure relates to a method and an apparatus for approximating non-linear function. In some embodiments, an exemplary processing unit includes: one or more registers for storing a lookup table (LUT) and one or more operation elements communicatively coupled with the one or more registers. The LUT includes a control state and a plurality of data entries. The one or more operation elements are configured to: receive an input operand; select one or more bits from the input operand; select a data entry from the plurality of data entries using the one or more bits; and determine an approximation value of a non-linear activation function for the input operand using the data entry.