G06F2212/302

Adaptive multilevel binning to improve hierarchical caching

A device driver calculates a tile size for a plurality of cache memories in a cache hierarchy. The device driver calculates a storage capacity of a first cache memory. The device driver calculates a first tile size based on the storage capacity of the first cache memory and one or more additional characteristics. The device driver calculates a storage capacity of a second cache memory. The device driver calculates a second tile size based on the storage capacity of the second cache memory and one or more additional characteristics, where the second tile size is different than the first tile size. The device driver transmits the second tile size to a second coalescing binning unit. One advantage of the disclosed techniques is that data locality and cache memory hit rates are improved where tile size is optimized for each cache level in the cache hierarchy.

ATOMIC MEMORY UPDATE UNIT AND METHODS
20220230378 · 2022-07-21 ·

In an aspect, an update unit can evaluate condition(s) in an update request and update one or more memory locations based on the condition evaluation. The update unit can operate atomically to determine whether to effect the update and to make the update. Updates can include one or more of incrementing and swapping values. An update request may specify one of a pre-determined set of update types. Some update types may be conditional and others unconditional. The update unit can be coupled to receive update requests from a plurality of computation units. The computation units may not have privileges to directly generate write requests to be effected on at least some of the locations in memory. The computation units can be fixed function circuitry operating on inputs received from programmable computation elements. The update unit may include a buffer to hold received update requests.

Method and apparatus for scheduling thread order to improve cache efficiency
11204801 · 2021-12-21 · ·

Systems and methods for scheduling thread order to improve cache efficiency are disclosed. In one embodiment, a graphics processor includes processing resources and schedule and dispatch logic to schedule and dispatch threads to the processing resources. The schedule and dispatch logic is configured to receive threads, to schedule and dispatch the threads based on a forward thread dispatch having a forward thread order, and to determine whether to disable a reversing of a thread order upon completion of at least a portion of the forward thread dispatch including a completion or ending of a draw call or a dispatch.

DEVICE AND METHOD FOR IMPROVING ROUTE PLANNING COMPUTING DEVICES
20220187086 · 2022-06-16 · ·

A route generator and method of operating the same including; calculating route traversal values for a plurality of blocks in a first group simultaneously, each block including a plurality of cells, traversal values being values that consider terrain movement cost data and data indicating progress towards a route endpoint on a per-cell basis, wherein the plurality of blocks are chosen such that the blocks in the first group fail to share any edges with other blocks in the first group.

WARPING DATA
20220188970 · 2022-06-16 ·

A method of warping data includes the steps of providing a set of target coordinates xcustom-character.sup.N, calculating, by a warping engine, source coordinates x′ ∈ custom-character.sup.N for the target coordinates xcustom-character.sup.N, requesting, by the warping engine, data values for a plurality of source coordinates from a cache, and computing, by the warping engine, interpolated data values for each x in a neighborhood of x′ from the data values of the source coordinates returned from the cache. Requesting data values from the cache includes notifying the cache that data values for a particular group of source points will be needed for computing interpolated data values for a particular target point, and fetching the data values for the particular group of source points when they are need for computing interpolated data values for the particular target point.

Apparatus and method for memory management in a graphics processing environment

An apparatus and method are described for implementing memory management in a graphics processing system. For example, one embodiment of an apparatus comprises: a first plurality of graphics processing resources to execute graphics commands and process graphics data; a first memory management unit (MMU) to communicatively couple the first plurality of graphics processing resources to a system-level MMU to access a system memory; a second plurality of graphics processing resources to execute graphics commands and process graphics data; a second MMU to communicatively couple the second plurality of graphics processing resources to the first MMU; wherein the first MMU is configured as a master MMU having a direct connection to the system-level MMU and the second MMU comprises a slave MMU configured to send memory transactions to the first MMU, the first MMU either servicing a memory transaction or sending the memory transaction to the system-level MMU on behalf of the second MMU.

Power aware translation lookaside buffer invalidation optimization

One disclosed embodiment includes a method for memory management. The method includes receiving a first request to clear one or more entries of a translation lookaside buffer (TLB), receiving a second request to clear one or more entries of the TLB, bundling the first request with the second request, determining that a processor associated with the TLB transitioned to an inactive mode, and dropping the bundled first and second requests based on the determination.

GENERAL-PURPOSE COMPUTING ACCELERATOR AND OPERATION METHOD THEREOF

Disclosed is a general-purpose computing accelerator which includes a memory including an instruction cache, a first executing unit performing a first computation operation, a second executing unit performing a second computation operation, an instruction fetching unit fetching an instruction stored in the instruction cache, a decoding unit that decodes the instruction, and a state control unit controlling a path of the instruction depending on an operation state of the second executing unit. The decoding unit provides the instruction to the first executing unit when the instruction is of a first type and provides the instruction to the state control unit when the instruction is of a second type. Depending on the operation state of the second executing unit, the state control unit provides the instruction of the second type to the second executing unit or stores the instruction of the second type as a register file in the memory.

GRAPHICS PROCESSING UNIT PROCESSING AND CACHING IMPROVEMENTS

Embodiments described herein are generally directed to improvements relating to power, latency, bandwidth and/or performance issues relating to GPU processing/caching. According to one embodiment, a state of multiple intellectual property (IP) cores that have access to a common cache via a central fabric is observed. Responsive to the observed state being indicative of performance of a standalone workload by a first IP core of the multiple IP cores, the common cache is treated as a local cache of the first IP core by powering off the central fabric and causing the first IP core to access the common cache via a low power access path between the first IP core and the common cache that is outside of the central fabric.

Modifying processing of commands in a command queue based on subsequently received data
11321808 · 2022-05-03 · ·

Processing of commands at a graphics processor are controlled by receiving input data and generating a command for processing at the graphics processor from the input data, wherein the command will cause the graphics processor to write out at least one buffer of data to an external memory, and submitting the command to a queue for later processing at the graphics processor. Subsequent to submitting the command, but before the write to external memory has been completed, further input data is received and it is determined that the buffer of data does not need to be written to external memory. The graphics processor is then signalled to prevent at least a portion of the write to external memory from being performed for the command.