G06F12/0888

Memory pipeline control in a hierarchical memory system

In described examples, a processor system includes a processor core generating memory transactions, a lower level cache memory with a lower memory controller, and a higher level cache memory with a higher memory controller having a memory pipeline. The higher memory controller is connected to the lower memory controller by a bypass path that skips the memory pipeline. The higher memory controller: determines whether a memory transaction is a bypass write, which is a memory write request indicated not to result in a corresponding write being directed to the higher level cache memory; if the memory transaction is determined a bypass write, determines whether a memory transaction that prevents passing is in the memory pipeline; and if no transaction that prevents passing is determined to be in the memory pipeline, sends the memory transaction to the lower memory controller using the bypass path.

Memory pipeline control in a hierarchical memory system

In described examples, a processor system includes a processor core generating memory transactions, a lower level cache memory with a lower memory controller, and a higher level cache memory with a higher memory controller having a memory pipeline. The higher memory controller is connected to the lower memory controller by a bypass path that skips the memory pipeline. The higher memory controller: determines whether a memory transaction is a bypass write, which is a memory write request indicated not to result in a corresponding write being directed to the higher level cache memory; if the memory transaction is determined a bypass write, determines whether a memory transaction that prevents passing is in the memory pipeline; and if no transaction that prevents passing is determined to be in the memory pipeline, sends the memory transaction to the lower memory controller using the bypass path.

TARGETING OF LATERAL CASTOUTS IN A DATA PROCESSING SYSTEM

A data processing system includes system memory and a plurality of processor cores each supported by a respective one of a plurality of vertical cache hierarchies. A first vertical cache hierarchy records information indicating communication of cache lines between the first vertical cache hierarchy and others of the plurality of vertical cache hierarchies. Based on selection of a victim cache line for eviction, the first vertical cache hierarchy determines, based on the recorded information, whether to perform a lateral castout of the victim cache line to another of the plurality of vertical cache hierarchies rather than to system memory and selects, based on the recorded information, a second vertical cache hierarchy among the plurality of vertical cache hierarchies as a recipient of the victim cache line via a lateral castout. Based on the determination, the first vertical cache hierarchy performs a castout of the victim cache line.

TARGETING OF LATERAL CASTOUTS IN A DATA PROCESSING SYSTEM

A data processing system includes system memory and a plurality of processor cores each supported by a respective one of a plurality of vertical cache hierarchies. A first vertical cache hierarchy records information indicating communication of cache lines between the first vertical cache hierarchy and others of the plurality of vertical cache hierarchies. Based on selection of a victim cache line for eviction, the first vertical cache hierarchy determines, based on the recorded information, whether to perform a lateral castout of the victim cache line to another of the plurality of vertical cache hierarchies rather than to system memory and selects, based on the recorded information, a second vertical cache hierarchy among the plurality of vertical cache hierarchies as a recipient of the victim cache line via a lateral castout. Based on the determination, the first vertical cache hierarchy performs a castout of the victim cache line.

Region mismatch prediction for memory access control circuitry

Memory access control circuitry controls handling of a memory access request based on at least one memory access control attribute associated with a region of address space including the target address. The memory access control circuitry comprises: lookup circuitry comprising a plurality of sets of comparison circuitry, each set of comparison circuitry to detect, based on at least one address-region-indicating parameter associated with a corresponding region of address space, whether the target address is within the corresponding region of address space; region mismatch prediction circuitry to provide a region mismatch prediction indicative of which of the sets of comparison circuitry is predicted to detect a region mismatch condition; and comparison disabling circuitry to disable at least one of the sets of comparison circuitry that is predicted by the region mismatch prediction circuitry to detect the region mismatch condition for the target address.

ACCESSING PHYSICAL MEMORY FROM A CPU OR PROCESSING ELEMENT IN A HIGH PERFOMANCE MANNER
20180004671 · 2018-01-04 ·

A method and apparatus is described herein for accessing a physical memory location referenced by a physical address with a processor. The processor fetches/receives instructions with references to virtual memory addresses and/or references to physical addresses. Translation logic translates the virtual memory addresses to physical addresses and provides the physical addresses to a common interface. Physical addressing logic decodes references to physical addresses and provides the physical addresses to a common interface based on a memory type stored by the physical addressing logic.

METHOD AND SYSTEM FOR EFFICIENT COMMUNICATION AND COMMAND SYSTEM FOR DEFERRED OPERATION
20180011794 · 2018-01-11 ·

A method and system for efficiently executing a delegate of a program by a processor coupled to an external memory. A payload including state data or command data is bound with a program delegate. The payload is mapped with the delegate via the payload identifier. The payload is pushed to a repository buffer in the external memory. The payload is flushed by reading the payload identifier and loading the payload from the repository buffer. The delegate is executed using the loaded payload.

METHOD AND SYSTEM FOR EFFICIENT COMMUNICATION AND COMMAND SYSTEM FOR DEFERRED OPERATION
20180011794 · 2018-01-11 ·

A method and system for efficiently executing a delegate of a program by a processor coupled to an external memory. A payload including state data or command data is bound with a program delegate. The payload is mapped with the delegate via the payload identifier. The payload is pushed to a repository buffer in the external memory. The payload is flushed by reading the payload identifier and loading the payload from the repository buffer. The delegate is executed using the loaded payload.

Graphics processors and graphics processing units having dot product accumulate instruction for hybrid floating point format

Described herein is a graphics processing unit (GPU) comprising a first processing cluster to perform parallel processing operations, the parallel processing operations including a ray tracing operation and a matrix multiply operation; and a second processing cluster coupled to the first processing cluster, wherein the first processing cluster includes a floating-point unit to perform floating point operations, the floating-point unit is configured to process an instruction using a bfloat16 (BF16) format with a multiplier to multiply second and third source operands while an accumulator adds a first source operand with output from the multiplier.

Graphics processors and graphics processing units having dot product accumulate instruction for hybrid floating point format

Described herein is a graphics processing unit (GPU) comprising a first processing cluster to perform parallel processing operations, the parallel processing operations including a ray tracing operation and a matrix multiply operation; and a second processing cluster coupled to the first processing cluster, wherein the first processing cluster includes a floating-point unit to perform floating point operations, the floating-point unit is configured to process an instruction using a bfloat16 (BF16) format with a multiplier to multiply second and third source operands while an accumulator adds a first source operand with output from the multiplier.