G06F9/30032

ROTATING ACCUMULATOR
20230116419 · 2023-04-13 · ·

A processing unit for generating an output vector is provided. The processing unit comprises an output vector register and a vector unit and is configured to execute machine code instructions, each instruction being an instance of a predefined set of instruction types in an instruction set of the processing unit. The instruction set includes a vector processing instruction defined by a corresponding opcode, which causes the processing unit to: i) process, using the vector unit, at least two input vectors to generate a result value; ii) perform a rotation operation on the plurality of elements of the output register in which the result value or a value based on the result value is placed in the first end element of the output register.

Vector table load instruction with address generation field to access table offset value

A processor includes a scalar processor core and a vector coprocessor core coupled to the scalar processor core. The scalar processor core is configured to retrieve an instruction stream from program storage, and pass vector instructions in the instruction stream to the vector coprocessor core. The vector coprocessor core includes a register file, a plurality of execution units, and a table lookup unit. The register file includes a plurality of registers. The execution units are arranged in parallel to process a plurality of data values. The execution units are coupled to the register file. The table lookup unit is coupled to the register file in parallel with the execution units. The table lookup unit is configured to retrieve table values from one or more lookup tables stored in memory by executing table lookup vector instructions in a table lookup loop.

Fastpath microcode sequencer

Systems, apparatuses, and methods for implementing a fastpath microcode sequencer are disclosed. A processor includes at least an instruction decode unit and first and second microcode units. For each received instruction, the instruction decode unit forwards the instruction to the first microcode unit if the instruction satisfies at least a first condition. In one implementation, the first condition is the instruction being classified as a frequently executed instruction. If a received instruction satisfies at least a second condition, the instruction decode unit forwards the received instruction to a second microcode unit. In one implementation, the first microcode unit is a smaller, faster structure than the second microcode unit. In one implementation, the second condition is the instruction being classified as an infrequently executed instruction. In other implementations, the instruction decode unit forwards the instruction to another microcode unit responsive to determining the instruction satisfies one or more other conditions.

Vector bit transpose

A method to transpose source data in a processor in response to a vector bit transpose instruction includes specifying, in respective fields of the vector bit transpose instruction, a source register containing the source data and a destination register to store transposed data. The method also includes executing the vector bit transpose instruction by interpreting N×N bits of the source data as a two-dimensional array having N rows and N columns, creating transposed source data by transposing the bits by reversing a row index and a column index for each bit, and storing the transposed source data in the destination register.

Systolic array including fused multiply accumulate with efficient prenormalization and extended dynamic range
11467806 · 2022-10-11 · ·

Systems and methods are provided to perform multiply-accumulate operations of normalized numbers in a systolic array to enable greater computational density, reduce the size of systolic arrays required to perform multiply-accumulate operations of normalized numbers, and/or enable higher throughput operation. The systolic array can be provided normalized numbers by a column of normalizers and can lack support for denormal numbers. Each normalizer can normalize the inputs to each processing element in the systolic array. The systolic array can include a multiplier and an adder. The multiplier can have multiple data paths that correspond to the data type of the input. The multiplier and adder can employ expanded exponent range to operate on normalized floating-point numbers and can lack support for denormal numbers.

WRITE CACHE CIRCUIT, DATA WRITE METHOD, AND MEMORY
20230141139 · 2023-05-11 ·

The present disclosure provides a write cache circuit, a data write method, and a memory. The write cache circuit includes: a control circuit configured to generate, on the basis of a mask write instruction, a first write pointer and a pointer to be positioned, generate a second write pointer on the basis of a write command, generate a first output pointer on the basis of a mask write shift instruction, and generate a second output pointer on the basis of a write shift instruction; a first cache circuit configured to cache, on the basis of the first write pointer, the pointer to be positioned and output a positioned pointer on the basis of the first output pointer, the positioned pointer being configured to instruct a second cache circuit to output a write address written by the second write pointer generated according to the mask write instruction.

Systems and methods to zero a tile register pair

Embodiments detailed herein relate to systems and methods to zero a tile register pair. In one example, a processor includes decode circuitry to decode a matrix pair zeroing instruction having fields for an opcode and an identifier to identify a destination matrix having a PAIR parameter equal to TRUE; and execution circuitry to execute the decoded matrix pair zeroing instruction to zero every element of a left matrix and a right matrix of the identified destination matrix.

Hardware co-ordination of resource management in distributed systems

Systems and methods are directed to methods and apparatus for transferring ownership of common resources from a source entity, which owns a resource, to a destination entity, which will own the resource, in a distributed system. The method includes the source entity receiving a command to change ownership (the MOVE command), and then marking the source entity as no longer owning the common resource. The source entity then sends a MOVE command to the destination entity, which will then update its common resource ownership table to reflect that the ownership of the common resource has been transferred from the source entity to the destination entity. It is advantageous that the updating of ownership of the common resource in the source entity occur simultaneously with the dispatching of the MOVE command to the destination entity.

Processors, methods, systems, and instructions to generate sequences of integers in numerical order that differ by a constant stride

A method of an aspect includes receiving an instruction indicating a destination storage location. A result is stored in the destination storage location in response to the instruction. The result includes a sequence of at least four non-negative integers in numerical order with all integers in consecutive positions differing by a constant stride of at least two. In an aspect, storing the result including the sequence of the at least four integers is performed without calculating the at least four integers using a result of a preceding instruction. Other methods, apparatus, systems, and instructions are disclosed.

COALESCING ADJACENT GATHER/SCATTER OPERATIONS

According to one embodiment, a processor includes an instruction decoder to decode a first instruction to gather data elements from memory, the first instruction having a first operand specifying a first storage location and a second operand specifying a first memory address storing a plurality of data elements. The processor further includes an execution unit coupled to the instruction decoder, in response to the first instruction, to read contiguous a first and a second of the data elements from a memory location based on the first memory address indicated by the second operand, and to store the first data element in a first entry of the first storage location and a second data element in a second entry of a second storage location corresponding to the first entry of the first storage location.