Patent classifications
G06F12/0837
Multi-tile memory management mechanism
Graphics processors for implementing multi-tile memory management are disclosed. In one embodiment, a graphics processor includes a first graphics device having a local memory, a second graphics device having a local memory, and a graphics driver to provide a single virtual allocation with a common virtual address range to mirror a resource to each local memory of the first and second graphics devices.
Multi-tile memory management mechanism
Graphics processors for implementing multi-tile memory management are disclosed. In one embodiment, a graphics processor includes a first graphics device having a local memory, a second graphics device having a local memory, and a graphics driver to provide a single virtual allocation with a common virtual address range to mirror a resource to each local memory of the first and second graphics devices.
Distributed index for fault tolerant object memory fabric
Embodiments of the invention provide systems and methods for managing processing, memory, storage, network, and cloud computing to significantly improve the efficiency and performance of processing nodes. Embodiments can implement an object memory fabric including object memory modules storing memory objects created natively within the object memory module and may be a managed at a memory layer. The memory module object directory may index all memory objects within the object memory module. A hierarchy of object routers communicatively coupling the object memory modules may each include a router object directory that indexes all memory objects and portions contained in object memory modules below the object router in the hierarchy. The hierarchy of object routers may behave in aggregate as a single object directory communicatively coupled to all object memory modules and to process requests based on the router object directories.
Distributed index for fault tolerant object memory fabric
Embodiments of the invention provide systems and methods for managing processing, memory, storage, network, and cloud computing to significantly improve the efficiency and performance of processing nodes. Embodiments can implement an object memory fabric including object memory modules storing memory objects created natively within the object memory module and may be a managed at a memory layer. The memory module object directory may index all memory objects within the object memory module. A hierarchy of object routers communicatively coupling the object memory modules may each include a router object directory that indexes all memory objects and portions contained in object memory modules below the object router in the hierarchy. The hierarchy of object routers may behave in aggregate as a single object directory communicatively coupled to all object memory modules and to process requests based on the router object directories.
DATA CACHE WITH HYBRID WRITEBACK AND WRITETHROUGH
Described is a data cache implementing hybrid writebacks and writethroughs. A processing system includes a memory, a memory controller, and a processor. The processor includes a data cache including cache lines, a write buffer, and a store queue. The store queue writes data to a hit cache line and an allocated entry in the write buffer when the hit cache line is initially in at least a shared coherence state, resulting in the hit cache line being in a shared coherence state with data and the allocated entry being in a modified coherence state with data. The write buffer requests and the memory controller upgrades the hit cache line to a modified coherence state with data based on tracked coherence states. The write buffer retires the data upon upgrade. The data cache writebacks the data to memory for a defined event.
SYSTEM AND METHOD FOR IMPLEMENTING A NETWORK-INTERFACE-BASED ALLREDUCE OPERATION
An apparatus is provided that includes a network interface to transmit and receive data packets over a network; a memory including one or more buffers; an arithmetic logic unit to perform arithmetic operations for organizing and combining the data packets; and a circuitry to receive, via the network interface, data packets from the network; aggregate, via the arithmetic logic unit, the received data packets in the one or more buffers at a network rate; and transmit, via the network interface, the aggregated data packets to one or more compute nodes in the network, thereby optimizing latency incurred in combining the received data packets and transmitting the aggregated data packets, and hence accelerating a bulk data allreduce operation. One embodiment provides a system and method for performing the allreduce operation. During operation, the system performs the allreduce operation by pacing network operations for enhancing performance of the allreduce operation.
Extend GPU/CPU coherency to multi-GPU cores
- Chandrasekaran Sakthivel ,
- Prasoonkumar Surti ,
- John C. Weast ,
- Sara S. Baghsorkhi ,
- Justin E. Gottschlich ,
- Abhishek R. Appu ,
- Nicolas C. Galoppo Von Borries ,
- Joydeep Ray ,
- Narayan Srinivasa ,
- Feng Chen ,
- Ben J. Ashbaugh ,
- Rajkishore Barik ,
- Tsung-Han Lin ,
- Kamal Sinha ,
- Eriko Nurvitadhi ,
- Balaji Vembu ,
- Altug Koker
In an example, an apparatus comprises a plurality of processing unit cores, a plurality of cache memory modules associated with the plurality of processing unit cores, and a machine learning model communicatively coupled to the plurality of processing unit cores, wherein the plurality of cache memory modules share cache coherency data with the machine learning model. Other embodiments are also disclosed and claimed.
Extend GPU/CPU coherency to multi-GPU cores
- Chandrasekaran Sakthivel ,
- Prasoonkumar Surti ,
- John C. Weast ,
- Sara S. Baghsorkhi ,
- Justin E. Gottschlich ,
- Abhishek R. Appu ,
- Nicolas C. Galoppo Von Borries ,
- Joydeep Ray ,
- Narayan Srinivasa ,
- Feng Chen ,
- Ben J. Ashbaugh ,
- Rajkishore Barik ,
- Tsung-Han Lin ,
- Kamal Sinha ,
- Eriko Nurvitadhi ,
- Balaji Vembu ,
- Altug Koker
In an example, an apparatus comprises a plurality of processing unit cores, a plurality of cache memory modules associated with the plurality of processing unit cores, and a machine learning model communicatively coupled to the plurality of processing unit cores, wherein the plurality of cache memory modules share cache coherency data with the machine learning model. Other embodiments are also disclosed and claimed.
Methods and systems for invalidating memory ranges in fabric-based architectures
Embodiments of the invention include a machine-readable medium having stored thereon at least one instruction, which if performed by a machine causes the machine to perform a method that includes decoding, with a node, an invalidate instruction; and executing, with the node, the invalidate instruction for invalidating a memory range specified across a fabric interconnect.
Cache for storing coherent and non-coherent data
The present disclosure advantageously provides a system cache and a method for storing coherent data and non-coherent data in a system cache. A transaction is received from a source in a system, the transaction including at least a memory address, the source having a location in a coherent domain or a non-coherent domain of the system, the coherent domain including shareable data and the non-coherent domain including non-shareable data. Whether the memory address is stored in a cache line is determined, and, when the memory address is not determined to be stored in a cache line, a cache line is allocated to the transaction including setting a state bit of the allocated cache line based on the source location to indicate whether shareable or non-shareable data is stored in the allocated cache line, and the transaction is processed.