G06F15/8007

MASKING FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURE
20230058355 · 2023-02-23 ·

A reconfigurable compute fabric can include multiple nodes, and each node can include multiple tiles with respective processing and storage elements. The tiles can be arranged in an array or grid and can be communicatively coupled. In an example, a first node can include a tile cluster of N memory-compute tiles, and the N memory-compute tiles can be coupled using a first portion of a synchronous compute fabric. Operations performed by the respective processing and storage elements of the N memory-compute tiles can be selectively enabled or disabled based on information in a mask field of data propagated through the first portion of the synchronous compute fabric.

PROCESSOR WITH TABLE LOOKUP UNIT

A processor includes a scalar processor core and a vector coprocessor core coupled to the scalar processor core. The scalar processor core is configured to retrieve an instruction stream from program storage, and pass vector instructions in the instruction stream to the vector coprocessor core. The vector coprocessor core includes a register file, a plurality of execution units, and a table lookup unit. The register file includes a plurality of registers. The execution units are arranged in parallel to process a plurality of data values. The execution units are coupled to the register file. The table lookup unit is coupled to the register file in parallel with the execution units. The table lookup unit is configured to retrieve table values from one or more lookup tables stored in memory by executing table lookup vector instructions in a table lookup loop.

PROCESSOR HAVING MULTIPLE CORES, SHARED CORE EXTENSION LOGIC, AND SHARED CORE EXTENSION UTILIZATION INSTRUCTIONS
20230052630 · 2023-02-16 ·

An apparatus of an aspect includes a plurality of cores and shared core extension logic coupled with each of the plurality of cores. The shared core extension logic has shared data processing logic that is shared by each of the plurality of cores. Instruction execution logic, for each of the cores, in response to a shared core extension call instruction, is to call the shared core extension logic. The call is to have data processing performed by the shared data processing logic on behalf of a corresponding core. Other apparatus, methods, and systems are also disclosed.

Message based general register file assembly

In an example, an apparatus comprises a plurality of execution units, and logic, at least partially including hardware logic, to assemble a general register file (GRF) message and hold the GRF message in storage in a data port until all data for the GRF message is received. Other embodiments are also disclosed and claimed.

Random matrix hardware for machine learning

A computer-implemented method for training a random matrix network is presented. The method includes initializing a random matrix, inputting a plurality of first vectors into the random matrix, and outputting a plurality of second vectors from the random matrix to be fed back into the random matrix for training. The random matrix can include a plurality of two-terminal devices or a plurality of three-terminal devices or a film-based device.

SYSTEM AND METHODS FOR EFFICIENT EXECUTION OF A COLLABORATIVE TASK IN A SHADER SYSTEM

Methods and systems are disclosed for executing a collaborative task in a shader system. Techniques disclosed include receiving, by the system, input data and computing instructions associated with the collaborative task, as well as a configuration setting, causing the system to operate in a takeover mode. The system then launches, exclusively in one workgroup processor, a workgroup including wavefronts configured to execute the collaborative task.

Streaming engine with flexible streaming engine template supporting differing number of nested loops with corresponding loop counts and loop offsets
11481327 · 2022-10-25 · ·

A streaming engine employed in a digital data processor specifies a fixed read only data stream defined by plural nested loops. An address generator produces address of data elements for the nested loops. A steam head register stores data elements next to be supplied to functional units for use as operands. A stream template specifies loop count and loop dimension for each nested loop. A format definition field in the stream template specifies the number of loops and the stream template bits devoted to the loop counts and loop dimensions. This permits the same bits of the stream template to be interpreted differently enabling trade off between the number of loops supported and the size of the loop counts and loop dimensions.

Micro-processor circuit and method of performing neural network operation

A micro-processor circuit and a method of performing neural network operation are provided. The micro-processor circuit is suitable for performing neural network operation. The micro-processor circuit includes a parameter generation module, a compute module and a truncation logic. The parameter generation module receives in parallel a plurality of input parameters and a plurality of weight parameters of the neural network operation. The parameter generation module generates in parallel a plurality of sub-output parameters according to the input parameters and the weight parameters. The compute module receives in parallel the sub-output parameters. The compute module sums the sub-output parameters to generate a summed parameter. The truncation logic receives the summed parameter. The truncation logic performs a truncation operation based on the summed parameter to generate a plurality of output parameters of the neural network operation.

Systems and methods for systolic array design from a high-level program
11604758 · 2023-03-14 · ·

Systems and methods for automated systolic array design from a high-level program are disclosed. One implementation of a systolic array design supporting a convolutional neural network includes a two-dimensional array of reconfigurable processing elements arranged in rows and columns. Each processing element has an associated SIMD vector and is connected through a local connection to at least one other processing element. An input feature map buffer having a double buffer is configured to store input feature maps, and an interconnect system is configured to pass data to neighboring processing elements in accordance with a processing element scheduler. A CNN computation is mapped onto the two-dimensional array of reconfigurable processing elements using an automated system configured to determine suitable reconfigurable processing element parameters.

Coalescing adjacent gather/scatter operations

According to one embodiment, a processor includes an instruction decoder to decode a first instruction to gather data elements from memory, the first instruction having a first operand specifying a first storage location and a second operand specifying a first memory address storing a plurality of data elements. The processor further includes an execution unit coupled to the instruction decoder, in response to the first instruction, to read contiguous a first and a second of the data elements from a memory location based on the first memory address indicated by the second operand, and to store the first data element in a first entry of the first storage location and a second data element in a second entry of a second storage location corresponding to the first entry of the first storage location.