G06F9/3877

Matrix operation optimization mechanism

An apparatus to facilitate machine learning matrix processing is disclosed. The apparatus comprises a memory to store matrix data one or more processors to execute an instruction to examine a message descriptor included in the instruction to determine a type of matrix layout manipulation operation that is to be executed, examine a message header included in the instruction having a plurality of parameters that define a two-dimensional (2D) memory surface that is to be retrieved, retrieve one or more blocks of the matrix data from the memory based on the plurality of parameters and a register file including a plurality of registers, wherein the one or more blocks of the matrix data is stored within a first set of the plurality of registers.

Offloading for gradient computation

Techniques for selectively offloading data that is computed by a first processing unit during training of an artificial neural network onto memory associated with a second processing unit and transferring the data back to the first processing unit when the data is needed for further processing are described herein. For example, the first processing unit may compute activations for operations associated with forward propagation. During the forward propagation, one or more of the activations may be transferred to a second processing unit for storage. Then, during backpropagation for the artificial neural network, the activations may be transferred back to the first processing unit as needed to compute gradients.

APPARATUS AND METHOD FOR PROPAGATING CONDITIONALLY EVALUATED VALUES IN SIMD/VECTOR EXECUTION USING AN INPUT MASK REGISTER

An apparatus and method for propagating conditionally evaluated values are disclosed. For example, a method according to one embodiment comprises: reading each value contained in an input mask register, each value being a true value or a false value and having a bit position associated therewith; for each true value read from the input mask register, generating a first result containing the bit position of the true value; for each false value read from the input mask register following the first true value, adding the vector length of the input mask register to a bit position of the last true value read from the input mask register to generate a second result; and storing each of the first results and second results in bit positions of an output register corresponding to the bit positions read from the input mask register.

Streaming engine with multi dimensional circular addressing selectable at each dimension
11709779 · 2023-07-25 · ·

A streaming engine employed in a digital data processor may specify a fixed read-only data stream defined by plural nested loops. An address generator produces address of data elements for the nested loops. A steam head register stores data elements next to be supplied to functional units for use as operands. A stream template register independently specifies a linear address or a circular address mode for each of the nested loops.

Computing device and method

The present disclosure provides a computation device. The computation device is configured to perform a machine learning computation, and includes an operation unit, a controller unit, and a conversion unit. The storage unit is configured to obtain input data and a computation instruction. The controller unit is configured to extract and parse the computation instruction from the storage unit to obtain one or more operation instructions, and to send the one or more operation instructions and the input data to the operation unit. The operation unit is configured to perform operations on the input data according to one or more operation instructions to obtain a computation result of the computation instruction. In the examples of the present disclosure, the input data involved in machine learning computations is represented by fixed-point data, thereby improving the processing speed and efficiency of training operations.

Methods and systems for graphics rendering assistance by a multi-access server

An illustrative multi-access server receives a request from a client system, the request indicating a requested rendering operation. The multi-access server also accesses input data from an asset data source. The multi-access server performs a rendering pass on the input data, the rendering pass performed in accordance with the requested rendering operation to generate a render pass output dataset. The render pass output dataset is representative of a renderable image depicting image content in a first form having limited quality or detail. The render pass output dataset is also configured for use in generating fully-rendered image data that depicts the image content in a second form having additional quality or detail beyond the limited quality or detail of the first form. Corresponding methods and systems are also disclosed.

DISTRIBUTED ACCELERATOR

Systems, methods, and devices are described coordinating a distributed accelerator. A command that includes instructions for performing a task is received. One or more sub-tasks of the task are determined to generate a set of sub-tasks. For each sub-task of the set of sub-tasks, an accelerator slice of a plurality of accelerator slices of a distributed accelerator is allocated, sub-task instructions for performing the sub-task are determined. Sub-task instructions are transmitted to the allocated accelerator slice for each sub-task. Each allocated accelerator slice is configured to generate a corresponding response indicative of the allocated accelerator slice having completed a respective sub-task. In a further example aspect, corresponding responses are received from each allocated accelerator slice and a coordinated response indicative of the corresponding responses is generated.

Communicating an event to a remote entity
11568045 · 2023-01-31 · ·

An example method includes detecting an event in an electronic system. The electronic system includes an electronic component and a switched mode power supply. The electronic component draws an amount of power from the switched mode power supply during operation. In response to detecting the event, the electronic component is operated to cause the electronic component to change the amount of power that the electronic component draws from the switched mode power supply. The change in the amount of power that the electronic component draws causes the switched mode power supply to output a signal that is evidence of the event.

Feature map and weight selection method and accelerating device

The present disclosure provides a processing device including: a coarse-grained pruning unit configured to perform coarse-grained pruning on a weight of a neural network to obtain a pruned weight, an operation unit configured to train the neural network according to the pruned weight. The coarse-grained pruning unit is specifically configured to select M weights from the weights of the neural network through a sliding window, and when the M weights meet a preset condition, all or part of the M weights may be set to 0. The processing device can reduce the memory access while reducing the amount of computation, thereby obtaining an acceleration ratio and reducing energy consumption.

Thread group scheduling for graphics processing

Embodiments are generally directed to thread group scheduling for graphics processing. An embodiment of an apparatus includes a plurality of processors including a plurality of graphics processors to process data; a memory; and one or more caches for storage of data for the plurality of graphics processors, wherein the one or more processors are to schedule a plurality of groups of threads for processing by the plurality of graphics processors, the scheduling of the plurality of groups of threads including the plurality of processors to apply a bias for scheduling the plurality of groups of threads according to a cache locality for the one or more caches.