Patent classifications
G06F9/3887
CONVOLUTIONAL NEURAL NETWORK ON PROGRAMMABLE TWO DIMENSIONAL IMAGE PROCESSOR
A method is described that includes executing a convolutional neural network layer on an image processor having an array of execution lanes and a two-dimensional shift register. The executing of the convolutional neural network includes loading a plane of image data of a three-dimensional block of image data into the two-dimensional shift register. The executing of the convolutional neural network also includes performing a two-dimensional convolution of the plane of image data with an array of coefficient values by sequentially: concurrently multiplying within the execution lanes respective pixel and coefficient values to produce an array of partial products; concurrently summing within the execution lanes the partial products with respective accumulations of partial products being kept within the two dimensional register for different stencils within the image data; and, effecting alignment of values for the two-dimensional convolution within the execution lanes by shifting content within the two-dimensional shift register array.
Streaming engine with multi dimensional circular addressing selectable at each dimension
A streaming engine employed in a digital data processor may specify a fixed read-only data stream defined by plural nested loops. An address generator produces address of data elements for the nested loops. A steam head register stores data elements next to be supplied to functional units for use as operands. A stream template register independently specifies a linear address or a circular address mode for each of the nested loops.
Implementing 128-bit SIMD operations on a 64-bit datapath
A method of implementing a processor architecture and corresponding system includes operands of a first size and a datapath of a second size. The second size is different from the first size. Given a first array of registers and a second array of registers, each register of the first and second arrays being of the second size, selecting a first register and corresponding second register from the first array and the second array, respectively, to perform operations of the first size. This allows a user, who is interfacing with the hardware processor through software, to provide data of the datapath bit-width instead of the register bit-width. Advantageously, the user is agnostic to the size of the registers.
APPARATUS TO OPTIMIZE GPU THREAD SHARED LOCAL MEMORY ACCESS
One embodiment provides for a graphics processor comprising first logic coupled with a first execution unit, the first logic to receive a first single instruction multiple data (SIMD) message from the first execution unit; second logic coupled with a second execution unit, the second logic to receive a second SIMD message from the second execution unit; and third logic coupled with a bank of shared local memory (SLM), the third logic to receive a first request to access the bank of SLM from the first logic, a second request to access the bank of SLM from the second logic, and in a single access cycle, schedule a read access to a read port for the first request and a write access to a write port for the second request.
Graphics processors and graphics processing units having dot product accumulate instruction for hybrid floating point format
Described herein is a graphics processing unit (GPU) comprising a first processing cluster to perform parallel processing operations, the parallel processing operations including a ray tracing operation and a matrix multiply operation; and a second processing cluster coupled to the first processing cluster, wherein the first processing cluster includes a floating-point unit to perform floating point operations, the floating-point unit is configured to process an instruction using a bfloat16 (BF16) format with a multiplier to multiply second and third source operands while an accumulator adds a first source operand with output from the multiplier.
INFORMATION PROCESSING DEVICE, INFORMATION PROCESSING METHOD, AND RECORDING MEDIUM
An information processing device performs a decision tree based on a decision tree which has condition determination nodes and leaf nodes. In the information processing device, an instruction unification means generates a unified instruction by unifying an instruction, which each of the condition determination nodes included in the decision tree executes, to be suitable for a parallel processing. An acquisition means acquires a plurality of pieces of input data. A condition determination means performs, by the parallel processing, a condition determination with respect to the plurality of pieces of input data for each of the condition determination nodes.
COMPUTATIONAL MEMORY WITH COOPERATION AMONG ROWS OF PROCESSING ELEMENTS AND MEMORY THEREOF
A computing device includes an array of processing elements mutually connected to perform single instruction multiple data (SIMD) operations, memory cells connected to each processing element to store data related to the SIMD operations, and a cache connected to each processing element to cache data related to the SIMD operations. Caches of adjacent processing elements are connected. The same or another computing device includes rows of mutually connected processing elements to share data. The computing device further includes a row arithmetic logic unit (ALU) at each row of processing elements. The row ALU of a respective row is configured to perform an operation with processing elements of the respective row.
Apparatus and method for ray tracing instruction processing and execution
An apparatus and method to execute ray tracing instructions. For example, one embodiment of an apparatus comprises execution circuitry to execute a dequantize instruction to convert a plurality of quantized data values to a plurality of dequantized data values, the dequantize instruction including a first source operand to identify a plurality of packed quantized data values in a source register and a destination operand to identify a destination register in which to store a plurality of packed dequantized data values, wherein the execution circuitry is to convert each packed quantized data value in the source register to a floating point value, to multiply the floating point value by a first value to generate a first product and to add the first product to a second value to generate a dequantized data value, and to store the dequantized data value in a packed data element location in the destination register.
Compute cluster preemption within a general-purpose graphics processing unit
Embodiments described herein provide techniques enable a graphics processor to continue processing operations during the reset of a compute unit that has experienced a hardware fault. Threads and associated context state for a faulted compute unit can be migrated to another compute unit of the graphics processor and the faulting compute unit can be reset while processing operations continue.
SINGLE INSTRUCTION MULTIPLE DATA SIMD INSTRUCTION GENERATION AND PROCESSING METHOD AND RELATED DEVICE
An SIMD instruction generation and processing method and a related device are provided. The method may include: obtaining a length of each loop dimension of a first tensor formula; selecting, from a plurality of groups of information about a first SIMD instruction model based on the length of each loop dimension of a first tensor formula, information about a second SIMD instruction model matching the first tensor formula; generating, based on a length of at least one loop dimension of the first tensor formula and the second SIMD instruction model, a first SIMD instruction obtained after the first tensor formula is converted. The information about a second SIMD instruction model is selected from the plurality of groups of information about a first SIMD instruction model based on the length of each loop dimension of the tensor formula.