G06F9/3555

Heavy-weight/light-weight GPU shader core pair architecture

A shader core includes a first processing element (PE), a second processing element, a register file and a warp sequencing unit. The first PE includes a first predetermined number of execution units, and the second PE includes a second predetermined number of execution units in which the second predetermined number of execution units is less than the first predetermined number of execution units. The register file shared by the first PE and the second PE. The warp sequencer unit (WSQ) is coupled to the first PE and to the second PE and schedules an instruction trace to execute on the first PE or the second PE based on information contained in a trace header of the instruction trace. The information contained in the trace header indicates whether the instruction trace is executable on the second PE.

ZERO KNOWLEDGE PROOF HARDWARE ACCELERATOR AND THE METHOD THEREOF
20210266168 · 2021-08-26 ·

A hardware accelerator for accelerating the zk-SNARK protocol by reducing the computation time of the cryptographic verification is disclosed. The accelerator includes a zk-SNARK engine having one or more processing units running in parallel. The processing unit can include one or more multiply-accumulate operation (MAC) units, one or more fast Fourier transform (FFT) units; and one or more elliptic curve processor (ECP) units. The one or more ECP units are configured to reduce a bit-length of a scalar cl, in an ECP algorithm used for generating a proof, thereby the cryptographic verification requires less computation power.

Systems and methods for efficient scaling of quantized integers

The disclosed computer-implemented method may include receiving an input value and a floating-point scaling factor and determining (1) an integer scaling factor based on the floating-point scaling factor, (2) a pre-scaling adjustment value representative of a number of places by which to shift a binary representation of the input value prior to a scaling operation, and (3) a post-scaling adjustment value representative of a number of places by which to shift the binary representation of the input value following the scaling operation. The method may further include calculating a scaled result value by (1) shifting rightwards the binary representation of the input value by the pre-scaling adjustment value, (2) scaling the shifted binary representation of the input value by the integer scaling factor, and (3) shifting rightwards the shifted and scaled binary value by the post-scaling adjustment value. Various other methods, systems, and computer-readable media are also disclosed.

Vector data transfer instruction
11003450 · 2021-05-11 · ·

A vector data transfer instruction is provided for triggering a data transfer between storage locations corresponding to a contiguous block of addresses and multiple data elements of at least one vector register. The instruction specifies a start address of the contiguous block using a base register and an immediate offset value specifies as a multiple of the size of the contiguous block of addresses. This is useful for loop unrolling which can help to improve performance of vectorised code by combining multiple iterations of a loop into a single iteration of an unrolled loop, to reduce the loop control overhead.

Sparse convolutional neural network accelerator

A method, computer program product, and system perform computations using a processor. A first instruction including a first index vector operand and a second index vector operand is received and the first index vector operand is decoded to produce first coordinate sets for a first array, each first coordinate set including at least a first coordinate and a second coordinate of a position of a non-zero element in the first array. The second index vector operand is decoded to produce second coordinate sets for a second array, each second coordinate set including at least a third coordinate and a fourth coordinate of a position of a non-zero element in the second array. The first coordinate sets are summed with the second coordinate sets to produce output coordinate sets and the output coordinate sets are converted into a set of linear indices.

SPARSE CONVOLUTIONAL NEURAL NETWORK ACCELERATOR

A method, computer program product, and system perform computations using a processor. A first instruction including a first index vector operand and a second index vector operand is received and the first index vector operand is decoded to produce first coordinate sets for a first array, each first coordinate set including at least a first coordinate and a second coordinate of a position of a non-zero element in the first array. The second index vector operand is decoded to produce second coordinate sets for a second array, each second coordinate set including at least a third coordinate and a fourth coordinate of a position of a non-zero element in the second array. The first coordinate sets are summed with the second coordinate sets to produce output coordinate sets and the output coordinate sets are converted into a set of linear indices.

CACHE LINE DEMOTE INFRASTRUCTURE FOR MULTI-PROCESSOR PIPELINES

Examples described herein relate to a manner of demoting multiple cache lines to shared memory. In some examples, a shared cache is accessible by at least two processor cores and a region of the cache is larger than a cache line and is designated for demotion from the cache to the shared cache. In some examples, the cache line corresponds to a memory address in a region of memory. In some examples, an indication that the region of memory is associated with a cache line demote operation is provided in an indicator in a page table entry (PTE). In some examples, the indication that the region of memory is associated with a cache line demote operation is based on a command in an application executed by a processor. In some examples, the cache is an level 1 (L1) or level 2 (L2) cache.

METHODOLOGIES, SYSTEMS, AND COMPONENTS FOR INCREMENTAL AND CONTINUAL LEARNING FOR SCALABLE IMPROVEMENT OF AUTONOMOUS SYSTEMS
20210206387 · 2021-07-08 ·

Autonomous driving systems may be provided with one or more sensors configured to capture perception data, a model configured to be continually trained in the transportation vehicle and a scalable subset memory configured to store a subset of a dataset previously used to train a model. A processor may be provided for continually training the model in the transportation vehicle using captured perception data previously unseen and the subset and for generating a new subset of data to be stored so that the model avoids catastrophic forgetting.

EFFICIENT LOOK-UP TABLE BASED FUNCTIONS FOR ARTIFICIAL INTELLIGENCE (AI) ACCELERATOR
20240005138 · 2024-01-04 ·

A method for approximating an activation function, the method including: receiving an input value of the activation function; determining that the input value is within a range, the range includes a set of non-uniform intervals; determining a selected interval from among the set of non-uniform intervals including the input value; retrieving, by a hardware accelerator, from a look-up table (LUT) associated with a type of the activation function, values of one or more quadratic interpolation parameters associated with the selected interval; performing a quadratic interpolation on the input value to approximate the input value using the values of the one or more quadratic interpolation parameters; and determining a first approximated output of the activation function based on a result of the quadratic interpolation performed on the input value.

SYSTEM AND METHOD FOR MANAGING EXECUTION OF PROCESSES ON A CLUSTER OF PROCESSING DEVICES
20200409718 · 2020-12-31 ·

Disclosed is a method and system for managing execution of processes on a cluster of processing devices by a supervising device. The method comprises receiving memory consumption information from each of a processing devices executing a plurality of processes. The method further comprises receiving information related to swapping of a new process from at least one processing device of the processing devices while memory available on the at least one processing device is insufficient to execute the new process. The method further comprises terminating either the new process being swapped or a process executing on the at least one processing device. The method further comprises instructing another processing device having sufficient memory available for execution of the new process being swapped or the process executing on the at least one processing device, whichever is terminated on the at least one processing device.