Patent classifications
G06F9/30079
Multi-tile Memory Management for Detecting Cross Tile Access Providing Multi-Tile Inference Scaling and Providing Page Migration
- Lakshminarayanan Striramassarma ,
- Prasoonkumar Surti ,
- Varghese George ,
- Ben Ashbaugh ,
- Aravindh Anantaraman ,
- Valentin Andrei ,
- Abhishek Appu ,
- Nicolas Galoppo von Borries ,
- Altug Koker ,
- Mike Macpherson ,
- Subramaniam Maiyuran ,
- Nilay Mistry ,
- Elmoustapha Ould-Ahmed-Vall ,
- Selvakumar Panneer ,
- Vasanth Ranganathan ,
- Joydeep Ray ,
- Ankur Shah ,
- Saurabh Tangri
Multi-tile Memory Management for Detecting Cross Tile Access, Providing Multi-Tile Inference Scaling with multicasting of data via copy operation, and Providing Page Migration are disclosed herein. In one embodiment, a graphics processor for a multi-tile architecture includes a first graphics processing unit (GPU) having a memory and a memory controller, a second graphics processing unit (GPU) having a memory and a cross-GPU fabric to communicatively couple the first and second GPUs. The memory controller is configured to determine whether frequent cross tile memory accesses occur from the first GPU to the memory of the second GPU in the multi-GPU configuration and to send a message to initiate a data transfer mechanism when frequent cross tile memory accesses occur from the first GPU to the memory of the second GPU.
MULTI-TILE MEMORY MANAGEMENT
- Abhishek R. Appu ,
- Altug Koker ,
- Aravindh Anantaraman ,
- Elmoustapha Ould-Ahmed-Vall ,
- Valentin Andrei ,
- Nicolas Galoppo von Borries ,
- Varghese George ,
- Mike Macpherson ,
- Subramaniam Maiyuran ,
- Joydeep Ray ,
- Lakshminarayana Striramassarma ,
- Scott Janus ,
- Brent Insko ,
- Vasanth Ranganathan ,
- Kamal Sinha ,
- Arthur Hunter ,
- Prasoonkumar Surti ,
- David Puffer ,
- James Valerio ,
- Ankur N. Shah
Methods and apparatus relating to techniques for multi-tile memory management. In an example, an apparatus comprises a cache memory, a high-bandwidth memory, a shader core communicatively coupled to the cache memory and comprising a processing element to decompress a first data element extracted from an in-memory database in the cache memory and having a first bit length to generate a second data element having a second bit length, greater than the first bit length, and an arithmetic logic unit (ALU) to compare the data element to a target value provided in a query of the in-memory database. Other embodiments are also disclosed and claimed.
GRAPHICS PROCESSOR OPERATION SCHEDULING FOR DETERMINISTIC LATENCY
Embodiments described herein include software, firmware, and hardware that provides techniques to enable deterministic scheduling across multiple general-purpose graphics processing units. One embodiment provides a multi-GPU architecture with uniform latency. One embodiment provides techniques to distribute memory output based on memory chip thermals. One embodiment provides techniques to enable thermally aware workload scheduling. One embodiment provides techniques to enable end to end contracts for workload scheduling on multiple GPUs.
COMPRESSION TECHNIQUES
Methods and apparatus relating to techniques for data compression. In an example, an apparatus comprises a processor receive a data compression instruction for a memory segment; and in response to the data compression instruction, compress a sequence of identical memory values in response to a determination that the sequence of identical memory values has a length which exceeds a threshold. Other embodiments are also disclosed and claimed.
GRAPHICS PROCESSORS AND GRAPHICS PROCESSING UNITS HAVING DOT PRODUCT ACCUMULATE INSTRUCTION FOR HYBRID FLOATING POINT FORMAT
Graphics processors and graphics processing units having dot product accumulate instructions for a hybrid floating point format are disclosed. In one embodiment, a graphics multiprocessor comprises an instruction unit to dispatch instructions and
a processing resource coupled to the instruction unit. The processing resource is configured to receive a dot product accumulate instruction from the instruction unit and to process the dot product accumulate instruction using a bfloat16 number (BF16) format.
DATA INITIALIZATION TECHNIQUES
Methods and apparatus relating to data initialization techniques. In an example, an apparatus comprises a processor to read one or more metadata codes which map to one or more cache lines in a cache memory and invoke a random number generator to generate random numerical data for the one or more cache lines in response to a determination that the one more metadata codes indicate that the cache lines are to contain random numerical data. Other embodiments are also disclosed and claimed.
SYSTOLIC DISAGGREGATION WITHIN A MATRIX ACCELERATOR ARCHITECTURE
Embodiments described herein include software, firmware, and hardware logic that provides techniques to perform arithmetic on sparse data via a systolic processing unit. One embodiment provides techniques to optimize training and inference on a systolic array when using sparse data. One embodiment provides techniques to use decompression information when performing sparse compute operations. One embodiment enables the disaggregation of special function compute arrays via a shared reg file. One embodiment enables packed data compress and expand operations on a GPGPU. One embodiment provides techniques to exploit block sparsity within the cache hierarchy of a GPGPU.
Fully pipelined binary conversion hardware operator logic circuit
A universal floating-point Instruction Set Architecture (ISA) implemented entirely in hardware. Using a single instruction, the universal floating-point ISA has the ability, in hardware, to compute directly with dual decimal character sequences up to IEEE 754-2008 “H=20” in length, without first having to explicitly perform a conversion-to-binary-format process in software before computing with these human-readable floating-point or integer representations. The ISA does not employ opcodes, but rather pushes and pulls “gobs” of data without the encumbering opcode fetch, decode, and execute bottleneck. Instead, the ISA employs stand-alone, memory-mapped operators, complete with their own pipeline that is completely decoupled from the processor's primary push-pull pipeline. The ISA employs special three-port, 1024-bit wide SRAMS; a special dual asymmetric system stack; memory-mapped stand-alone hardware operators with private result buffers having simultaneously readable side-A and side-B read ports; and dual hardware H=20 convertFromDecimalCharacter conversion operators.
Universal floating-point instruction set architecture for computing directly with decimal character sequences and binary formats in any combination
A universal floating-point Instruction Set Architecture (ISA) implemented entirely in hardware. Using a single instruction, the universal floating-point ISA has the ability, in hardware, to compute directly with dual decimal character sequences up to IEEE 754-2008 “H=20” in length, without first having to explicitly perform a conversion-to-binary-format process in software before computing with these human-readable floating-point or integer representations. The ISA does not employ opcodes, but rather pushes and pulls “gobs” of data without the encumbering opcode fetch, decode, and execute bottleneck. Instead, the ISA employs stand-alone, memory-mapped operators, complete with their own pipeline that is completely decoupled from the processor's primary push-pull pipeline. The ISA employs special three-port, 1024-bit wide SRAMS; a special dual asymmetric system stack; memory-mapped stand-alone hardware operators with private result buffers having simultaneously readable side-A and side-B read ports; and dual hardware H=20 convertFromDecimalCharacter conversion operators.
Cache size change
A method includes determining, by a level one (L1) controller, to change a size of a L1 main cache; servicing, by the L1 controller, pending read requests and pending write requests from a central processing unit (CPU) core; stalling, by the L1 controller, new read requests and new write requests from the CPU core; writing back and invalidating, by the L1 controller, the L1 main cache. The method also includes receiving, by a level two (L2) controller, an indication that the L1 main cache has been invalidated and, in response, flushing a pipeline of the L2 controller; in response to the pipeline being flushed, stalling, by the L2 controller, requests received from any master; reinitializing, by the L2 controller, a shadow L1 main cache. Reinitializing includes clearing previous contents of the shadow L1 main cache and changing the size of the shadow L1 main cache.