Patent classifications
G06F9/3867
Graphics processors and graphics processing units having dot product accumulate instruction for hybrid floating point format
Described herein is a graphics processing unit (GPU) comprising a first processing cluster to perform parallel processing operations, the parallel processing operations including a ray tracing operation and a matrix multiply operation; and a second processing cluster coupled to the first processing cluster, wherein the first processing cluster includes a floating-point unit to perform floating point operations, the floating-point unit is configured to process an instruction using a bfloat16 (BF16) format with a multiplier to multiply second and third source operands while an accumulator adds a first source operand with output from the multiplier.
TRACKING EXACT CONVERGENCE TO GUIDE THE RECOVERY PROCESS IN RESPONSE TO A MISPREDICTED BRANCH
Processors and methods related to tracking exact convergence to guide the recovery process in response to a mispredicted branch are provided. An example processor includes a pipeline having a frontend and a backend. The processor further includes a state table for maintaining information related to at least a subset of branches corresponding to instructions being processed by the processor. The processor further includes state logic configured to access the state table and track locations of any exact convergence points associated with branches corresponding to the instructions being processed by the processor. The state logic is further configured to identify a first recovery method for recovering from a misprediction associated with a branch if a location of an exact convergence point associated with the branch is determined to be in the frontend of the pipeline, else identify a second recovery method for recovering from the misprediction associated with the branch.
Clock mesh-based power conservation in a coprocessor based on in-flight instruction characteristics
A pipeline includes a first portion configured to process a first subset of bits of an instruction and a second portion configured to process a second subset of the bits of the instruction. A first clock mesh is configured to provide a first clock signal to the first portion of the pipeline. A second clock mesh is configured to provide a second clock signal to the second portion of the pipeline. The first and second clock meshes selectively provide the first and second clock signals based on characteristics of in-flight instructions that have been dispatched to the pipeline but not yet retired. In some cases, a physical register file is configured to store values of bits representative of instructions. Only the first subset is stored in the physical register file in response to the value of the zero high bit indicating that the second subset is equal to zero.
Differential pipeline delays in a coprocessor
A coprocessor such as a floating-point unit includes a pipeline that is partitioned into a first portion and a second portion. A controller is configured to provide control signals to the first portion and the second portion of the pipeline. A first physical distance traversed by control signals propagating from the controller to the first portion of the pipeline is shorter than a second physical distance traversed by control signals propagating from the controller to the second portion of the pipeline. A scheduler is configured to cause a physical register file to provide a first subset of bits of an instruction to the first portion at a first time. The physical register file provides a second subset of the bits of the instruction to the second portion at a second time subsequent to the first time.
Execution unit
An execution unit comprising a processing pipeline configured to perform calculations to evaluate a plurality of mathematical functions. The processing pipeline comprises a plurality of stages through which each calculation for evaluating a mathematical function progresses to an end result. Each of a plurality of processing circuits in the pipeline is configured to perform an operation on input values during at least one stage of the plurality of stages. The plurality of processing circuits include multiplier circuits. A first multiplier circuit and a second multiplier circuit are configured to operate in parallel, such that at the same stage in the processing pipeline, the first multiplier circuit and the second multiplier circuit perform their processing. A third multiplier circuit is arranged in series with the first multiplier circuit and the second multiplier circuit and processes outputs from the first multiplier circuit and the second multiplier circuit.
Digital Messages in a Load Control System
A load control system may comprise load control devices for controlling respective electrical loads, and a system controller operable to transmit digital messages including different commands to the load control devices in response to a selection of a preset. The different commands may include a preset command configured to identify preset data in a device database stored at the load control device and/or a multi-output command configured to define the preset data for being stored in the device database. The system controller may decide which of the commands to transmit to the load control devices in response to the selection of the preset.
METHOD AND APPARATUS TO SORT A VECTOR FOR A BITONIC SORTING ALGORITHM
A method is provided that includes performing, by a processor in response to a vector sort instruction, sorting of values stored in lanes of the vector to generate a sorted vector, wherein the values in a first portion of the lanes are sorted in a first order indicated by the vector sort instruction and the values in a second portion of the lanes are sorted in a second order indicated by the vector sort instruction; and storing the sorted vector in a storage location.
LOCK FREE HIGH THROUGHPUT RESOURCE STREAMING
Methods, systems and apparatuses may provide for technology that conducts, via a plurality of concurrent threads, transfers of graphics resources into and out of graphics memory, wherein the transfers bypass lock operations between the plurality of concurrent threads, generates frames based on the graphics resources in the graphics memory, and streams the frames to a display. In one example, the transfers also bypass explicit wait operations for the graphics resources to be fully resident in the graphics memory.
REMOTE FRONT-DROP FOR RECOVERY AFTER PIPELINE STALL
This disclosure describes techniques for performing a remote front-drop of data for recovery after a pipeline stall. The techniques include using a receiver-side dropping strategy that is driven from the sender-side. Components of a pipeline determine whether a pipeline is operating within specified latency constraints (e.g., experiencing a pipeline stall). Upon detecting a pipeline stall, the sending device is notified of the stall. Once the sending device is notified of the pipeline stall, the sending device can determine what action(s) to perform to address the pipeline stall. For example, the sending device may instruct one or more components of the pipeline to discard already sent data that has not been processed. This allows the older data to be dropped on the stalled pipeline while keeping the more recently sent data.
Queues for inter-pipeline data hazard avoidance
Methods and parallel processing units for avoiding inter-pipeline data hazards identified at compile time. For each identified inter-pipeline data hazard the primary instruction and secondary instruction(s) thereof are identified as such and are linked by a counter which is used to track that inter-pipeline data hazard. When a primary instruction is output by the instruction decoder for execution the value of the counter associated therewith is adjusted to indicate that there is hazard related to the primary instruction, and when primary instruction has been resolved by one of multiple parallel processing pipelines the value of the counter associated therewith is adjusted to indicate that the hazard related to the primary instruction has been resolved. When a secondary instruction is output by the decoder for execution, the secondary instruction is stalled in a queue associated with the appropriate instruction pipeline if at least one counter associated with the primary instructions from which it depends indicates that there is a hazard related to the primary instruction.