G06F9/30079

Tag update bus for updated coherence state

An apparatus includes a CPU core and a L1 cache subsystem including a L1 main cache, a L1 victim cache, and a L1 controller. The apparatus includes a L2 cache subsystem coupled to the L1 cache subsystem by a transaction bus and a tag update bus. The L2 cache subsystem includes a L2 main cache, a shadow L1 main cache, a shadow L1 victim cache, and a L2 controller. The L2 controller receives a message from the L1 controller over the tag update bus, including a valid signal, an address, and a coherence state. In response to the valid signal being asserted, the L2 controller identifies an entry in the shadow L1 main cache or the shadow L1 victim cache having an address corresponding to the address of the message and updates a coherence state of the identified entry to be the coherence state of the message.

Global coherence operations

A method includes receiving, by a L2 controller, a request to perform a global operation on a L2 cache and preventing new blocking transactions from entering a pipeline coupled to the L2 cache while permitting new non-blocking transactions to enter the pipeline. Blocking transactions include read transactions and non-victim write transactions. Non-blocking transactions include response transactions, snoop transactions, and victim transactions. The method further includes, in response to an indication that the pipeline does not contain any pending blocking transactions, preventing new snoop transactions from entering the pipeline while permitting new response transactions and victim transactions to enter the pipeline; in response to an indication that the pipeline does not contain any pending snoop transactions, preventing, all new transactions from entering the pipeline; and, in response to an indication that the pipeline does not contain any pending transactions, performing the global operation on the L2 cache.

MEMORY DEVICE FOR PERFORMING IN-MEMORY PROCESSING

A memory device configured to perform in-memory processing includes a plurality of in-memory arithmetic units each configured to perform in-memory processing of a pipelined arithmetic operation, and a plurality of memory banks allocated to the in-memory arithmetic units such that a set of n memory banks is allocated to each of the in-memory operation units, each memory bank configured to perform an access operation of data requested from the in-memory arithmetic units while the pipelined arithmetic operation is performed. Each of the in-memory arithmetic units is configured to operate at a first operating frequency that is less than or equal to a product of n and a second operating frequency of each of the memory banks.

SYSTEMS AND METHODS FOR UPDATING MEMORY SIDE CACHES IN A MULTI-GPU CONFIGURATION

Systems and methods for updating remote memory side caches in a multi-GPU configuration are disclosed herein. In one embodiment, a graphics processor for a multi-tile architecture includes a first graphics processing unit (GPU) having a first memory, a first memory side cache memory, a first communication fabric, and a first memory management unit (MMU). The graphics processor includes a second graphics processing unit (GPU) having a second memory, a second memory side cache memory, a second memory management unit (MMU), and a second communication fabric that is communicatively coupled to the first communication fabric. The first MMU is configured to control memory requests for the first memory, to update content in the first memory, to update content in the first memory side cache memory, and to determine whether to update the content in the second memory side cache memory.

IC including logic tile, having reconfigurable MAC pipeline, and reconfigurable memory
11288076 · 2022-03-29 · ·

An integrated circuit including configurable multiplier-accumulator circuitry, wherein, during processing operations, a plurality of the multiplier-accumulator circuits are serially connected into pipelines to perform concatenated multiply and accumulate operations. The integrated circuit includes a first memory and a second memory, and a switch interconnect network, including configurable multiplexers arranged in a plurality of switch matrices. The first and second memories are configurable as either a dedicated read memory or a dedicated write memory and connected to a given pipeline, via the switch interconnect network, during a processing operation performed thereby; wherein, during a first processing operations, the first memory is dedicated to write data to a first pipeline and the second memory is dedicated to read data therefrom and, during a second processing operation, the first memory is dedicated to read data from a second pipeline and the second memory is dedicated to write data thereto.

DYNAMIC MEMORY RECONFIGURATION

Embodiments described herein provide techniques to enable the dynamic reconfiguration of memory on a general-purpose graphics processing unit. One embodiment described herein enables dynamic reconfiguration of cache memory bank assignments based on hardware statistics. One embodiment enables for virtual memory address translation using mixed four kilobyte and sixty-four kilobyte pages within the same page table hierarchy and under the same page directory. One embodiment provides for a graphics processor and associated heterogenous processing system having near and far regions of the same level of a cache hierarchy.

TANH AND SIGMOID FUNCTION EXECUTION
20220066737 · 2022-03-03 ·

Examples described herein relate to instructions to request performance of tanh and sigmoid instructions. For example, a compiler can generate native tanh instructions to perform tanh. In some examples, a tanh function can be compiled into instructions that include an instruction to perform either tanh(input) or tanh(input)/input depending on a value of the input to generate an intermediate output; an instruction to cause a performance of generation of scale factor based on the input; and an instruction to cause performance of a multiplication operation on the intermediate result with the scale factor. For example, a sigmoid function can be compiled to cause a math pipeline to perform a range check and performs operations based on a range.

HARDWARE COHERENCE SIGNALING PROTOCOL

An apparatus includes a CPU core and a L1 cache subsystem including a L1 main cache, a L1 victim cache, and a L1 controller. The apparatus includes a L2 cache subsystem including a L2 main cache, a shadow L1 main cache, a shadow L1 victim cache, and a L2 controller configured to receive a read request from the L1 controller as a single transaction. Read request includes a read address, a first indication of an address and a coherence state of a cache line A to be moved from the L1 main cache to the L1 victim cache to allocate space for data returned in response to the read request, and a second indication of an address and a coherence state of a cache line B to be removed from the L1 victim cache in response to the cache line A being moved to the L1 victim cache.

ALTERNATE PATH FOR BRANCH PREDICTION REDIRECT

Branch prediction circuitry predicts an outcome of a branch instruction. A pipeline circuitry processes instructions along a first path from a predicted branch of the branch instruction. The instructions along the first path are processed concurrently with processing instructions along a second path from an unpredicted branch of the branch instruction. Information representing the state of the second portion while processing the second path is stored in one or more buffers. The instructions are processed along the second path using the information stored in the buffers in response to a misprediction of the outcome of the branch instruction. In some cases, the branch prediction circuitry determines a confidence level for the predicted outcome and the instructions along the second path from the unpredicted branch are processed in response to the confidence level being below a threshold confidence.

System and method for executing a number of NOP instructions after a repeated instruction
11269633 · 2022-03-08 · ·

A method is provided for executing instructions in a pipelined processor. The method includes receiving a plurality of instructions in the pipelined processor. A first instruction of the plurality of instructions has a first bit field for holding a value for indicating how many times execution of the first instruction is repeated. Also, the value is for indicating how many no operation (NOP) instructions follow a last iteration of the repeated first instruction. The number of repeated instructions plus the number of NOP instructions is equal to the number of pipeline stages in the pipelined processor. In another embodiment, a pipelined data processor is provided for executing the repeating instruction.