G06F9/3893

METHOD AND APPARATUS FOR A LOGIC-BASED FILTER ENGINE

A cross-domain guard is disclosed that includes a field programmable gate array (FPGA). The FPGA includes a rule database containing one or more rules, a memory interconnect configured to send control data or rule processing data, media access control logic, and a plurality of filter engines configured to receive an incoming message and generate a processed message. Each of the plurality of filter engines may contain a message processing allocation element configured to receive and distribute the incoming message, and a plurality of rule processor kernels. Each of the plurality of rule processor kernels includes a rule processor kernel control element, a plurality of data operator kernels configured to perform a data comparison operation, a ternary lookup table processor configured to perform a logic operation based upon a result of the data comparison operation, and a processed message arbiter. A method for filtering incoming messages is also disclosed.

Modulo operation unit
11507813 · 2022-11-22 · ·

The present disclosure advantageously provides a modulo operation unit that includes a first input configured to receive operand data, a second input configured to receive modulus data, an initial modulo stage, a sequence of intermediate modulo stages, and a final modulo stage.

Modular gated multiplier circuitry and multiplication technique

Various implementations described herein are related to a device having multiplier circuitry with an array of summation result cells that holds summation bit values for shifted arrays added together. The device may include latch circuitry having one or more gated elements disposed between the summation result cells, and the gated elements may be adapted to provide a portion of the summation bit values based on a gating signal.

Systolic array-friendly data placement and control based on masked write

The present disclosure relates to an accelerator for systolic array-friendly data placement. The accelerator may include: a systolic array comprising a plurality of operation units, wherein the systolic array is configured to receive staged input data and perform operations using the staged input to generate staged output data, the staged output data comprising a number of segments; a controller configured to execute one or more instructions to generate a pattern generation signal; a data mask generator; and a memory configured to store the staged output data using the generated masks. The data mask generator may include circuitry configured to: receive the pattern generation signal from the controller, and, based on the received signal, generate a mask corresponding to each segment of the staged output data.

ADAPTIVE MATRIX MULTIPLICATION ACCELERATOR FOR MACHINE LEARNING AND DEEP LEARNING APPLICATIONS
20230041850 · 2023-02-09 ·

An adaptive matrix multiplier. In some embodiments, the matrix multiplier includes a first multiplying unit a second multiplying unit,a memory load circuit, and an outer buffer circuit. The first multiplying unit includes a first inner buffer circuit and a second inner buffer circuit, and the second multiplying unit includes a first inner buffer circuit and a second inner buffer circuit. The memory load circuit is configured to load data from memory, in a single burst of a burst memory access mode, into the first inner buffer circuit of the first multiplying unit; and into the first inner buffer circuit of the second multiplying unit.

Systems, apparatuses, and methods for chained fused multiply add

Embodiments of systems, apparatuses, and methods for chained fused multiply add. In some embodiments, an apparatus includes a decoder to decode a single instruction having an opcode, a destination field representing a destination operand, a first source field representing a plurality of packed data source operands of a first type that have packed data elements of a first size, a second source field representing a plurality of packed data source operands that have packed data elements of a second size, and a field for a memory location that stores a scalar value. A register file having a plurality of packed data registers includes registers for the plurality of packed data source operands that have packed data elements of a first size, the source operands that have packed data elements of a second size, and the destination operand. Execution circuitry executes the decoded single instruction to perform iterations of packed fused multiply accumulate operations by multiplying packed data elements of the sources of the first type by sub-elements of the scalar value, and adding results of these multiplications to an initial value in a first iteration and a result from a previous iteration in subsequent iterations.

SYSTEMS, APPARATUSES, AND METHODS FOR CHAINED FUSED MULTIPLY ADD

Embodiments of systems, apparatuses, and methods for chained fused multiply add. In some embodiments, an apparatus includes a decoder to decode a single instruction having an opcode, a destination field representing a destination operand, a first source field representing a plurality of packed data source operands of a first type that have packed data elements of a first size, a second source field representing a plurality of packed data source operands that have packed data elements of a second size, and a field for a memory location that stores a scalar value. A register file having a plurality of packed data registers includes registers for the plurality of packed data source operands that have packed data elements of a first size, the source operands that have packed data elements of a second size, and the destination operand. Execution circuitry executes the decoded single instruction to perform iterations of packed fused multiply accumulate operations by multiplying packed data elements of the sources of the first type by sub-elements of the scalar value, and adding results of these multiplications to an initial value in a first iteration and a result from a previous iteration in subsequent iterations.

Reducing operations of sum-of-multiply-accumulate (SOMAC) instructions

Methods, systems and apparatuses for reducing operations of Sum-Of-Multiply-Accumulate (SOMAC) instructions are disclosed. One method includes scheduling, by a scheduler, a thread for execution, executing, by a processor of a plurality of processors, the thread, fetching, by the processor, a plurality of instructions for the thread from a memory, selecting, by a thread arbiter of the processor, an instruction of the plurality of instructions for execution in an arithmetic logic unit (ALU) pipeline of the processor, and reading the instruction, and determining, by a macro-instruction iterator of the processor, whether the instruction is a Sum-Of-Multiply-Accumulate (SOMAC) instruction with an instruction size, wherein the instruction size indicates a number of iterations that the SOMAC instruction is to be executed.

MULTIPLY-ACCUMULATE WITH VARIABLE FLOATING POINT PRECISION

An integrated circuit including a multiplier-accumulator execution pipeline including a plurality of multiplier-accumulator circuits to, in operation, perform multiply and accumulate operations, wherein each multiplier-accumulator circuit includes: (i) a multiplier to multiply first input data, having a first floating point data format, by a filter weight data, having the first floating point data format, and generate and output a product data having a second floating point data format, and (ii) an accumulator, coupled to the multiplier of the associated MAC circuit, to add second input data and the product data output by the associated multiplier to generate sum data. The plurality of multiplier-accumulator circuits of the multiplier-accumulator execution pipeline may be connected in series and, in operation, perform a plurality of concatenated multiply and accumulate operations.

Apparatus and method for controlling complex multiply-accumulate circuitry
11474825 · 2022-10-18 · ·

An apparatus and method for performing multiply-accumulate (MAC) operations on complex numbers to generate real results. For example, one embodiment of a processor comprises: a decoder to decode instructions including multiply-accumulate instructions; first and second source registers to store a first plurality of complex values and a second plurality of complex values, respectively, each complex value comprising a real value and an imaginary value; multiply-accumulate (MAC) execution circuitry coupled to the first and second source registers comprising multiplier circuitry, adder circuitry, and accumulator circuitry; mode selection circuitry to select between at least two execution modes for the MAC execution circuitry including a first mode in which the MAC execution circuitry is to perform complex multiply-accumulate operations using real and imaginary values from the first plurality of complex values and the second plurality of complex values and a second mode in which the MAC execution circuitry is to replace one or more of the real or imaginary values from the first and second plurality of complex values with one or more real or imaginary values specified in a set of scalar complex numbers or with zeroes.