Patent classifications
G06F9/3887
GATHERING PAYLOAD FROM ARBITRARY REGISTERS FOR SEND MESSAGES IN A GRAPHICS ENVIRONMENT
An apparatus to facilitate gathering payload from arbitrary registers for send messages in a graphics environment is disclosed. The apparatus includes processing resources comprising execution circuitry to receive a send gather message instruction identifying a number of registers to access for a send message and identifying IDs of a plurality of individual registers corresponding to the number of registers; decode a first phase of the send gather message instruction; based on decoding the first phase, cause a second phase of the send gather message instruction to bypass an instruction decode stage; and dispatch the first phase subsequently followed by dispatch of the second phase to a send pipeline. The apparatus can also perform an immediate move of the IDs of the plurality of individual registers to an architectural register of the execution circuitry and include a pointer to the architectural register in the send gather message instruction.
ADVANCED WAVELET FILTERING FOR ACCELERATED DEEP LEARNING
Techniques in wavelet filtering for advanced deep learning provide improvements in one or more of accuracy, performance, and energy efficiency. An array of processing elements comprising a portion of a neural network accelerator performs flow-based computations on wavelets of data. Each processing element comprises a compute element to execute programmed instructions using the data and a router to route the wavelets in accordance with virtual channel specifiers. Each processing element is enabled to perform local filtering of wavelets received at the processing element, selectively, conditionally, and/or optionally discarding zero or more of the received wavelets, thereby preventing further processing of the discarded wavelets. The wavelet filtering is performed by one or more configurable wavelet filters operable in various modes, such as counter, sparse, and range modes.
Hierarchical register file device based on spin transfer torque-random access memory
The embodiments provide a register file device which increases energy efficiency using a spin transfer torque-random access memory for a register file used to compute a general purpose graphic processing device, and hierarchically uses a register cache and a buffer together with the spin transfer torque-random access memory, to minimize leakage current, reduce a write operation power, and solve the write delay.
Matrix data broadcast architecture
Systems, apparatuses, and methods for efficient parallel execution of multiple work units in a processor by reducing a number of memory accesses are disclosed. A computing system includes a processor core with a parallel data architecture. The processor core executes a software application with matrix operations. The processor core supports the broadcast of shared data to multiple compute units of the processor core. A compiler or other code assigns thread groups to compute units based on detecting shared data among the compute units. Rather than send multiple read accesses to a memory subsystem for the shared data, the processor core generates a single access request. The single access request includes information to identify the multiple compute units for receiving the shared data when broadcasted by the processor core.
TECHNIQUES FOR RECOVERING FROM ERRORS WHEN EXECUTING SOFTWARE APPLICATIONS ON PARALLEL PROCESSORS
In various embodiments, a software program uses hardware features of a parallel processor to checkpoint a context associated with an execution of a software application on the parallel processor. The software program uses a preemption feature of the parallel processor to cause the parallel processor to stop executing instructions in accordance with the context. The software program then causes the parallel processor to collect state data associated with the context. After generating a checkpoint based on the state data, the software program causes the parallel processor to resume executing instructions in accordance with the context.
METHOD AND APPARATUS FOR IMPLIED BIT HANDLING IN FLOATING POINT MULTIPLICATION
A method is provided that includes performing, by a processor in response to a floating point multiply instruction, multiplication of floating point numbers, wherein determination of values of implied bits of leading bit encoded mantissas of the floating point numbers is performed in parallel with multiplication of the encoded mantissas, and storing, by the processor, a result of the floating point multiply instruction in a storage location indicated by the floating point multiply instruction.
Streaming engine with flexible streaming engine template supporting differing number of nested loops with corresponding loop counts and loop offsets
A streaming engine employed in a digital data processor specifies a fixed read only data stream defined by plural nested loops. An address generator produces address of data elements for the nested loops. A steam head register stores data elements next to be supplied to functional units for use as operands. A stream template specifies loop count and loop dimension for each nested loop. A format definition field in the stream template specifies the number of loops and the stream template bits devoted to the loop counts and loop dimensions. This permits the same bits of the stream template to be interpreted differently enabling trade off between the number of loops supported and the size of the loop counts and loop dimensions.
Reducing operations of sum-of-multiply-accumulate (SOMAC) instructions
Methods, systems and apparatuses for reducing operations of Sum-Of-Multiply-Accumulate (SOMAC) instructions are disclosed. One method includes scheduling, by a scheduler, a thread for execution, executing, by a processor of a plurality of processors, the thread, fetching, by the processor, a plurality of instructions for the thread from a memory, selecting, by a thread arbiter of the processor, an instruction of the plurality of instructions for execution in an arithmetic logic unit (ALU) pipeline of the processor, and reading the instruction, and determining, by a macro-instruction iterator of the processor, whether the instruction is a Sum-Of-Multiply-Accumulate (SOMAC) instruction with an instruction size, wherein the instruction size indicates a number of iterations that the SOMAC instruction is to be executed.
HARDWARE ARCHITECTURE TO ACCELERATE GENERATIVE ADVERSARIAL NETWORKS WITH OPTIMIZED SIMD-MIMD PROCESSING ELEMENTS
Systems, apparatuses and methods may provide for technology that includes transformation hardware to convert input data from a time domain into a frequency domain, a generative model, and a discriminative model coupled to the transformation hardware and the generative model, wherein the generative model and the discriminative model are to operate in the frequency domain.
Micro-processor circuit and method of performing neural network operation
A micro-processor circuit and a method of performing neural network operation are provided. The micro-processor circuit is suitable for performing neural network operation. The micro-processor circuit includes a parameter generation module, a compute module and a truncation logic. The parameter generation module receives in parallel a plurality of input parameters and a plurality of weight parameters of the neural network operation. The parameter generation module generates in parallel a plurality of sub-output parameters according to the input parameters and the weight parameters. The compute module receives in parallel the sub-output parameters. The compute module sums the sub-output parameters to generate a summed parameter. The truncation logic receives the summed parameter. The truncation logic performs a truncation operation based on the summed parameter to generate a plurality of output parameters of the neural network operation.