G06F9/3887

RECONFIGURABLE SIMD ENGINE
20230214351 · 2023-07-06 · ·

An exemplary SIMD computing system comprises a SIMD processing element (SPE) configured to perform a selected operation on a portion of a processor input data word, with the operation selected by control signals read from a control memory location addressed by a decoded instruction. The SPE may comprise one or more adder, multiplier, or multiplexer coupled to the control signals. The control signals may comprise one or more bit read from the control memory. The control memory may be an MxN (M rows by N columns) memory having M possible SIMD operations and N control signals. Each instruction decoded may select an SPE operation from among N rows. A plurality of SPEs may receive the same control signals. The control memory may be rewritable, advantageously permitting customizable SIMD operations that are reconfigurable by storing in the control memory locations control signals designed to cause the SPE to perform selected operations.

Technologies for assigning workloads to balance multiple resource allocation objectives

Technologies for allocating resources of managed nodes to workloads to balance multiple resource allocation objectives include an orchestrator server to receive resource allocation objective data indicative of multiple resource allocation objectives to be satisfied. The orchestrator server is additionally to determine an initial assignment of a set of workloads among the managed nodes and receive telemetry data from the managed nodes. The orchestrator server is further to determine, as a function of the telemetry data and the resource allocation objective data, an adjustment to the assignment of the workloads to increase an achievement of at least one of the resource allocation objectives without decreasing an achievement of another of the resource allocation objectives, and apply the adjustments to the assignments of the workloads among the managed nodes as the workloads are performed. Other embodiments are also described and claimed.

Compute optimizations for neural networks using ternary weight

One embodiment provides for a compute apparatus comprising a decode unit to decode a single instruction into a decoded instruction that specifies multiple operands including a multi-bit input value and a ternary weight associated with a neural network and an arithmetic logic unit including a multiplier, an adder, and an accumulator register. To execute the decoded instruction, the multiplier is to perform a multiplication operation on the multi-bit input based on the ternary weight to generate an intermediate product and the adder is to add the intermediate product to a value stored in the accumulator register and update the value stored in the accumulator register.

ACCELERATED PROCESSING DEVICE AND METHOD OF SHARING DATA FOR MACHINE LEARNING
20230004385 · 2023-01-05 · ·

A processing device is provided which comprises a plurality of compute units configured to process data, a plurality of arithmetic logic units, instantiated separate from the plurality of compute units, and configured to store the data at the arithmetic logic units and perform calculations using the data and an interconnect network, connecting the arithmetic logic units and configured to provide the arithmetic logic units with shared access to the data for communication between the arithmetic logic units. The interconnect network is also configured to provide the compute units with shared access to the data for communication between the compute units.

MACHINE LEARNING CLUSTER PIPELINE FUSION

Methods, systems, and devices for pipeline fusion of a plurality of kernels. In some implementations, a first batch of a first kernel is executed on a first processing device to generate a first output of the first kernel based on an input. A first batch of a second kernel is executed on a second processing device to generate a first output of the second kernel based on the first output of the first kernel. A second batch of the first kernel is executed on the first processing device to generate a second output of the first kernel based on the input. The execution of the second batch of the first kernel overlaps at least partially in time with executing the first batch of the second kernel.

STREAMING ENGINE WITH STREAM METADATA SAVING FOR CONTEXT SWITCHING
20230004391 · 2023-01-05 ·

A streaming engine employed in a digital data processor specifies a fixed read only data stream defined by plural nested loops. An address generator produces addresses of data elements. A steam head register stores data elements next to be supplied to functional units for use as operands. Stream metadata is stored in response to a stream store instruction. Stored stream metadata is restored to the stream engine in response to a stream restore instruction. An interrupt changes an open stream to a frozen state discarding stored stream data. A return from interrupt changes a frozen stream to an active state.

Monolithic vector processor configured to operate on variable length vectors using a vector length register

A computer processor comprising a vector unit is disclosed. The vector unit may comprise a vector register file comprising at least one register to hold a varying number of elements. The vector unit may further comprise a vector length register file comprising at least one register to specify the number of operations of a vector instruction to be performed on the varying number of elements in the at least one register of the vector register file. The computer processor may be implemented as a monolithic integrated circuit.

64-BIT TWO-DIMENSIONAL BLOCK LOAD WITH TRANSPOSE

An apparatus to facilitate 64-bit two-dimensional (2D) block load with transpose is disclosed. The apparatus includes a processor comprising processing resources; and load store pipeline hardware circuitry coupled to the processing resources, the load store pipeline hardware circuitry to receive a 64-bit two-dimensional (2D) block load message with transpose from the processing resources. The load store pipeline hardware circuitry comprising a load store pipeline sequencer to map rows of a block of memory corresponding to the 64-bit 2D block load message with transpose to 64-bit standard load messages; and load store pipeline return circuitry to: sequentially number general register files (GRFs) used for returning elements of the block of memory accessed by the 64-bit standard load messages to the processing resources; and return, to the processing resources, the sequentially numbered GRFs in response to the 64-bit 2D block load message with transpose.

FUSED INSTRUCTION TO ACCELERATE PERFORMANCE OF SECURE HASH ALGORITHM 2 (SHA-2) WORKLOADS IN A GRAPHICS ENVIRONMENT

An apparatus to facilitate a fused instruction to accelerate performance of secure hash algorithm 2 (SHA-2) in a graphics environment is disclosed. The apparatus includes a processor comprising processing resources, the processing resources comprising execution circuitry to receive a fused SHA instruction identifying a length corresponding to a data size of the fused SHA instruction and a functional control identifying an operation type of the fused SHA instruction; based on decoding the fused SHA instruction, cause a sub-function identified by the length and the function control to be scheduled to an integer pipeline of the execution resource; and execute the sub-function of the fused SHA instruction in an integer pipeline of the execution circuitry, the sub-function to perform merged operations on a source operand of the fused SHA instruction, the merged operations comprising a rotate operation, a shift operation, and an xor operation.

LARGE INTEGER MULTIPLICATION ENHANCEMENTS FOR GRAPHICS ENVIRONMENT

An apparatus to facilitate large integer multiplication enhancements in a graphics environment is disclosed. The apparatus includes a processor comprising processing resources, the processing resources comprising multiplier circuitry to: receive operands for a multiplication operation, wherein the multiplication operation is part of a chain of multiplication operations for a large integer multiplication; and issue a multiply and add (MAD) instruction for the multiplication operation utilizing at least one of a double precision multiplier or a 48 bit output, wherein the MAD instruction to generate an output in a single clock cycle of the processor.