G06F9/30105

COALESCING ADJACENT GATHER/SCATTER OPERATIONS

According to one embodiment, a processor includes an instruction decoder to decode a first instruction to gather data elements from memory, the first instruction having a first operand specifying a first storage location and a second operand specifying a first memory address storing a plurality of data elements. The processor further includes an execution unit coupled to the instruction decoder, in response to the first instruction, to read contiguous a first and a second of the data elements from a memory location based on the first memory address indicated by the second operand, and to store the first data element in a first entry of the first storage location and a second data element in a second entry of a second storage location corresponding to the first entry of the first storage location.

Systems, Apparatuses, And Methods For Fused Multiply Add

Embodiments of systems, apparatuses, and methods for fused multiple add. In some embodiments, a decoder decodes a single instruction having an opcode, a destination field representing a destination operand, and fields for a first, second, and third packed data source operand, wherein packed data elements of the first and second packed data source operand are of a first, different size than a second size of packed data elements of the third packed data operand. Execution circuitry then executes the decoded single instruction to perform, for each packed data element position of the destination operand, a multiplication of a M N-sized packed data elements from the first and second packed data sources that correspond to a packed data element position of the third packed data source, add of results from these multiplications to a full-sized packed data element of a packed data element position of the third packed data source, and storage of the addition result in a packed data element position destination corresponding to the packed data element position of the third packed data source, wherein M is equal to the full-sized packed data element divided by N.

Resource allocation in a multi-processor system

A system includes a memory-mapped register (MMR) associated with a claim logic circuit, a claim field for the MMR, a first firewall for a first address region, and a second firewall for a second address region. The MMR is associated with an address in the first address region and an address in the second address region. The first firewall is configured to pass a first write request for an address in the first address region to the claim logic circuit associated with the MMR. The claim logic circuit associated with the MMR is configured to grant or deny the first write request based on the claim field for the MMR. Further, the second firewall is configured to receive a second write request for an address in the second address region and grant or deny the second write request based on a permission level associated with the second write request.

Arithmetic processing device and control method implemented by arithmetic processing device

An arithmetic processing device includes: a decoding circuit configured to decode a command; a command execution circuit configured to execute the command decoded by the decoding circuit; a register circuit configured to include a plurality of registers for holding data used by the command execution circuit; an identification information holding circuit configured to store identification information for identifying a register for writing a specific value when the command is a register writing command; a setting circuit configured to hold the specific value; and an operation control circuit configured to execute inhibiting processing when the command is a register reading command, the inhibiting processing including inhibiting an access of the register by the register reading command and selecting the specific value held in the setting circuit.

Microprocessor having self-resetting register scoreboard
11204770 · 2021-12-21 · ·

A microprocessor using a counter in a scoreboard is introduced to handle data dependency. The microprocessor includes a register file having a plurality of registers mapped to entries of the scoreboard. Each entry of the scoreboard has a counter that tracks the data dependency of each of the registers. The counter decrements for every clock cycle until the counter resets itself when it counts down to 0. With the implementation of the counter in the scoreboard, the instruction pipeline may be managed according to the number of clock cycles of a previous issued instruction takes to access the register which is recorded in the counter of the scoreboard.

HARDWARE ENFORCEMENT OF BOUNDARIES ON THE CONTROL, SPACE, TIME, MODULARITY, REFERENCE, INITIALIZATION, AND MUTABILITY ASPECTS OF SOFTWARE
20210389946 · 2021-12-16 ·

Modifications to existing computer hardware, compiler changes or source-to-source transforms performed during the software build process, and a collection of libraries and modifications to existing standard system software and libraries. The invention allows a program author to enforce various kinds of locality of causality in software to provide enforcement of boundaries for the following aspects of a computer program: control, space, time, modularity, reference, initialization, and mutability. Where these properties do not suffice to guarantee a property at static time, dynamic checks may be added and the constraints on control flow prevent such dynamic checks from being avoided by the program.

Hardware Accelerator For IM2COL Operation

The present disclosure advantageously provides a matrix expansion unit that includes an input data selector, a first register set, a second register set, and an output data selector. The input data selector is configured to receive first matrix data in a columnwise format. The first register set is coupled to the input data selector, and includes a plurality of data selectors and a plurality of registers arranged in a first shift loop. The second register set is coupled to the data selector, and includes a plurality of data selectors and a plurality of registers arranged in a second shift loop. The output data selector is coupled to the first register set and the second register set, and is configured to output second matrix data in a rowwise format.

Method of Using Multidimensional Blockification To Optimize Computer Program and Device Thereof

Disclosed embodiments relate to a method and device for optimizing compilation of source code. The proposed method receives a first intermediate representation code of a source code and analyses each basic block instruction of the plurality of basic block instructions contained in the first intermediate representation code for blockification. In order to blockify the identical instructions, the one or more groups of basic block instructions are assessed for eligibility of blockification. Upon determining as eligible, the group of basic block instructions are blockified using one of one dimensional SIMD vectorization and two-dimensional SIMD vectorization. The method further generates a second intermediate representation of the source code which is translated to executable target code with more efficient processing capacity.

APPARATUSES, METHODS, AND SYSTEMS FOR INSTRUCTIONS FOR LOADING A TILE OF A MATRIX OPERATIONS ACCELERATOR
20220206989 · 2022-06-30 ·

Systems, methods, and apparatuses relating to one or more instructions for loading a tile of a matrix operations accelerator are described. In one embodiment, a system includes a matrix operations accelerator circuit comprising a two-dimensional grid of processing elements, a plurality of registers that represents a two-dimensional matrix coupled to the two-dimensional grid of processing elements, and a coupling to a cache; and a hardware processor core coupled to the matrix operations accelerator circuit and comprising a vector register, a decoder circuit to decode a single instruction into a decoded instruction, the single instruction including a first field that identifies the two-dimensional matrix, a second field that identifies a location in the cache, and a third field that identifies the vector register, and an opcode that indicates an execution circuit of the hardware processor core is to load elements into the plurality of registers that represents the two-dimensional matrix from the location in the cache by the coupling to the cache, and load one or more elements from the vector register into the plurality of registers that represents the two-dimensional matrix by a coupling of the hardware processor core to the matrix operations accelerator circuit that is separate from the coupling to the cache, and the execution circuit of the hardware processor core to execute the decoded instruction according to the opcode.

Systems and methods for performing instructions to convert to 16-bit floating-point format

Disclosed embodiments relate to systems and methods for performing instructions to convert to 16-bit floating-point format. In one example, a processor includes fetch circuitry to fetch an instruction having fields to specify an opcode and locations of a first source vector comprising N single-precision elements, and a destination vector comprising at least N 16-bit floating-point elements, the opcode to indicate execution circuitry is to convert each of the elements of the specified source vector to 16-bit floating-point, the conversion to include truncation and rounding, as necessary, and to store each converted element into a corresponding location of the specified destination vector, decode circuitry to decode the fetched instruction, and execution circuitry to respond to the decoded instruction as specified by the opcode.