Patent classifications
G06F2212/401
COMPUTE OPTIMIZATIONS FOR LOW PRECISION MACHINE LEARNING OPERATIONS
One embodiment provides an apparatus comprising a memory stack including multiple memory dies and a parallel processor including a plurality of multiprocessors. Each multiprocessor has a single instruction, multiple thread (SIMT) architecture, the parallel processor coupled to the memory stack via one or more memory interfaces. At least one multiprocessor comprises a multiply-accumulate circuit to perform multiply-accumulate operations on matrix data in a stage of a neural network implementation to produce a result matrix comprising a plurality of matrix data elements at a first precision, precision tracking logic to evaluate metrics associated with the matrix data elements and indicate if an optimization is to be performed for representing data at a second stage of the neural network implementation, and a numerical transform unit to dynamically perform a numerical transform operation on the matrix data elements based on the indication to produce transformed matrix data elements at a second precision.
SYSTEMS AND METHODS FOR TRANSFORMING DATA IN-LINE WITH READS AND WRITES TO COHERENT HOST-MANAGED DEVICE MEMORY
The disclosed computer-implemented method may include (1) receiving, from an external host processor via a cache-coherent interconnect, a request to access a host address of a coherent memory space of the external host processor, (2) when the request is to read data from the host address, (a) performing an in-line transformation on the data to generate second data and (b) writing the second data to the physical address of the device-attached physical memory mapped to the host address, and (3) when the request is to read data from the host address, (a) reading the data from the physical address of the device-attached physical memory mapped to the host address, (b) performing a reversing in-line transformation on the data to generate second data, and (c) returning the second data to the external host processor via the cache-coherent interconnect. Various other methods, systems, and computer-readable media are also disclosed.
Systems and methods for reading and writing sparse data in a neural network accelerator
Disclosed herein includes a system, a method, and a device for reading and writing sparse data in a neural network accelerator. A plurality of slices can be established to access a memory having an access size of a data word. A first slice can be configured to access a first side of the data word in memory. Circuitry can access a mask identifying byte positions within the data word having non-zero values. The circuitry can modify the data word to have non-zero byte values stored starting at an end of the first side, and any zero byte values stored in a remainder of the data word. A determination can be made whether a number of non-zero byte values is less than or equal to a first access size of the first slice. The circuitry can write the modified data word to the memory via at least the first slice.
Metadata management in a storage system
A system and method for efficiently maintaining metadata stored among a plurality of solid-state storage devices. A data storage subsystem supports multiple mapping tables. Records within a mapping table are arranged in multiple levels. Each level stores at least pairs of a key value and a physical pointer value. The levels are sorted by time. New records are inserted in a created new highest (youngest) level. No edits are performed in-place. A data storage controller determines both a cost of searching a given table exceeds a threshold and an amount of memory used to flatten levels exceeds a threshold. In response, the controller incrementally flattens selected levels within the table based on key ranges. After flattening the records in the selected levels within the key range, the records may be removed from the selected levels. The process repeats with another different key range.
Select decompression headers and symbol start indicators used in writing decompressed data
One or more units of decompressed data of a plurality of units of decompressed data is written to a target location for subsequent writing to memory. The plurality of units of decompressed data includes a plurality of symbol outputs and has associated therewith a plurality of decompression headers. A determination is made that the subsequent writing to memory of at least a portion of another unit of decompressed data to be written to the target location is to be stalled. A symbol start position of the other unit of decompressed data and a decompression header of a selected unit of the one or more units of decompressed data written to the target location are provided to a component of the computing environment. The decompression header is used for the subsequent writing of the other unit of decompressed data to memory.
Quantum modulation-based data search
An efficient search includes: inputting data comprising a vector that requires a first amount of memory; compressing the vector into a compressed representation while preserving information content of the vector, including: encoding, using one or more non-quantum processors, at least a portion of the vector to implement a quantum gate matrix; and modulating a reference vector using the quantum gate matrix to generate the compressed representation; searching the compressed vector in a database; and outputting a search result to be displayed, stored, and/or further processed.
FLEXIBLE DICTIONARY SHARING FOR COMPRESSED CACHES
Systems, apparatuses, and methods for implementing flexible dictionary sharing techniques for caches are disclosed. A set-associative cache includes a dictionary for each data array set. When a cache line is to be allocated in the cache, a cache controller determines to which set a base index of the cache line address maps. Then, a selector unit determines which dictionary of a group of dictionaries stored by those sets neighboring this set would achieve the most compression for the cache line. This dictionary is then selected to compress the cache line. An offset is added to the base index of the cache line to generate a full index in order to map the cache line to the set corresponding to this chosen dictionary. The compressed cache line is stored in this set with the chosen dictionary, and the offset is stored in the corresponding tag array entry.
RESOURCE-AWARE COMPRESSION
Systems, apparatuses, and methods for implementing a multi-tiered approach to cache compression are disclosed. A cache includes a cache controller, light compressor, and heavy compressor. The decision on which compressor to use for compressing cache lines is made based on certain resource availability such as cache capacity or memory bandwidth. This allows the cache to opportunistically use complex algorithms for compression while limiting the adverse effects of high decompression latency on system performance. To address the above issue, the proposed design takes advantage of the heavy compressors for effectively reducing memory bandwidth in high bandwidth memory (HBM) interfaces as long as they do not sacrifice system performance. Accordingly, the cache combines light and heavy compressors with a decision-making unit to achieve reduced off-chip memory traffic without sacrificing system performance.
Cache miss handling for read operations in data processing systems
In a data processing system comprising a cache system configured to transfer data stored in a memory system to a processor and vice-versa, a processing unit operable to read data from a cache of the cache system can send a read request for data to the cache. The cache system, in response to the read request, determines whether the requested data is present in the cache. When the requested data is present in the cache, the cache system returns the data from the cache to the processing unit and invalidates the entry for the data in the cache. When the requested data is not present in the cache, the cache system returns an indication of that to the processing unit, without the cache system sending a request for the data towards the memory system.
Systems and methods for improving cache efficiency and utilization
- Altug Koker ,
- Joydeep Ray ,
- Ben Ashbaugh ,
- Jonathan Pearce ,
- Abhishek Appu ,
- Vasanth Ranganathan ,
- Lakshminarayanan Striramassarma ,
- Elmoustapha Ould-Ahmed-Vall ,
- Aravindh Anantaraman ,
- Valentin Andrei ,
- Nicolas Galoppo von Borries ,
- Varghese George ,
- Yoav Harel ,
- Arthur Hunter, JR. ,
- Brent Insko ,
- Scott Janus ,
- Pattabhiraman K ,
- Mike Macpherson ,
- Subramaniam Maiyuran ,
- Marian Alin Petre ,
- Murali Ramadoss ,
- Shailesh Shah ,
- Kamal Sinha ,
- Prasoonkumar Surti ,
- Vikranth Vemulapalli
Systems and methods for improving cache efficiency and utilization are disclosed. In one embodiment, a graphics processor includes processing resources to perform graphics operations and a cache controller of a cache coupled to the processing resources. The cache controller is configured to control cache priority by determining whether default settings or an instruction will control cache operations for the cache.