G06F2212/454

USING DYNAMIC DATA STRUCTURES FOR STORING DATA OBJECTS
20230034198 · 2023-02-02 ·

A technique for dynamic data structure usage for storing data objects is described. In one example of the present disclosure, a system can receive a data object and properties associated with the data object. The system can determine, based on at least one of the properties and pre-defined rules for data objects and corresponding object types, an object type of the data object and a first data structure for storing the data object that is different from a second data structure currently storing data objects in the memory. The system can output a command for causing the first data structure to store the data object in the memory.

COMPUTING ARCHITECTURE
20220350745 · 2022-11-03 ·

Computing architecture comprises an off-chip memory, an on-chip cache unit, a prefetching unit, a global scheduler, a transmitting unit, a pre-recombination network, a post-recombination network, a main computing array, a write-back cache unit, a data dependence controller and an auxiliary computing array. The architecture reads data tiles into an on-chip cache in a prefetching mode, and performs computing according to the data tiles; in the computing process of the tiles, a tile exchange network is adopted to recombine a data structure, and a data dependence module is arranged to process a data dependence relationship possibly existing between different tiles. According to the computing architecture, the data utilization rate can be increased, the data processing flexibility is improved, and therefore Cache Miss is reduced, and the memory bandwidth pressure is reduced.

METHOD AND APPARATUS FOR DATA CACHING

The present invention provides a method and apparatus for data caching. The method comprises: output matrixes are acquired one by one, a plurality of acquired output matrixes are written alternately into two queue sets of a first cache unit according to a sequence in which the output matrixes are acquired, and the output matrixes stored line by line in a first cache unit are written into a second cache unit one by one, according to the sequence in which the output matrixes are written into the second cache unit, valid data of each output matrix of the second cache unit is determined one by one according to preset parameters, and the valid data of each output matrix is written into a third cache unit, and the valid data of the output matrixes stored in the third cache unit are configured to be sequentially written into a memory according to a sequence in which the valid data are written into the third cache unit. In the present solution, the output matrixes are cached by using cache units with the writing speed matching with the computing speed of a processor, and the output matrixes are completely written into a memory one by one according to a sequence of generation time. Therefore, the present invention may solve the problem that the computing speed of the processor does not match with the writing speed of the memory.

Systems and methods for energy-efficient data processing

An energy-efficient sequencer comprising inline multipliers and adders causes a read source that contains matching values to output an enable signal to enable a data item prior to using a multiplier to multiply the data item with a weight to obtain a product for use in a matrix-multiplication in hardware. A second enable signal causes the output to be written to the data item.

COMBINED ON-PACKAGE AND OFF-PACKAGE MEMORY SYSTEM

A combined on-package and off-package memory system uses a custom base-layer within which are fabricated one or more dedicated interfaces to off-package memories. An on-package processor and on-package memories are also directly coupled to the custom base-layer. The custom base-layer includes memory management logic between the processor and memories (both off and on package) to steer requests. The memories are exposed as a combined memory space having greater bandwidth and capacity compared with either the off-package memories or the on-package memories alone. The memory management logic services requests while maintaining quality of service (QoS) to satisfy bandwidth requirements for each allocation. An allocation may include any combination of the on and/or off package memories. The memory management logic also manages data migration between the on and off package memories.

Streaming engine with flexible streaming engine template supporting differing number of nested loops with corresponding loop counts and loop offsets
11481327 · 2022-10-25 · ·

A streaming engine employed in a digital data processor specifies a fixed read only data stream defined by plural nested loops. An address generator produces address of data elements for the nested loops. A steam head register stores data elements next to be supplied to functional units for use as operands. A stream template specifies loop count and loop dimension for each nested loop. A format definition field in the stream template specifies the number of loops and the stream template bits devoted to the loop counts and loop dimensions. This permits the same bits of the stream template to be interpreted differently enabling trade off between the number of loops supported and the size of the loop counts and loop dimensions.

PROCESSING METHOD AND ACCELERATING DEVICE
20220335299 · 2022-10-20 ·

The present disclosure provides a processing device including: a coarse-grained pruning unit configured to perform coarse-grained pruning on a weight of a neural network to obtain a pruned weight, an operation unit configured to train the neural network according to the pruned weight. The coarse-grained pruning unit is specifically configured to select M weights from the weights of the neural network through a sliding window, and when the M weights meet a preset condition, all or part of the M weights may be set to 0. The processing device can reduce the memory access while reducing the amount of computation, thereby obtaining an acceleration ratio and reducing energy consumption.

ACCELERATION SYSTEM, METHOD AND STORAGE MEDIUM BASED ON CONVOLUTIONAL NEURAL NETWORK
20230128529 · 2023-04-27 ·

An acceleration system includes: a direct memory accessor configured to store a computation graph, a first data stream lake buffer and a second data stream lake buffer, the first data stream lake buffer being configured to cache the computation graph; an arithmetic unit configured to obtain an i-th layer of computing nodes of the computation graph to obtain an (i+1)-th layer of computing nodes; and the first fan-out device configured to replicate the (i+1)-th layer of computing nodes and store the same in the direct memory accessor and the second data stream lake buffer, respectively. The arithmetic unit extracts the (i+1)-th layer of computing nodes from the second data stream lake buffer to obtain a (i+2)-th layer of computing nodes, and the above steps are repeated until the n layer of computing nodes is obtained, where 1≤i≤n-3, n≥4, i is a positive integer, and n is a positive integer.

PROCESSING UNIT ARCHITECTURES AND TECHNIQUES FOR REUSABLE INSTRUCTIONS AND DATA

A computing system can include an off-chip memory and processing unit integrated circuitry. The processing unit IC can include on-chip compute circuitry, a first on-chip memory and a second on-chip memory. The off-chip memory can be configured to store instructions and data The first on-chip memory can be configured to store reusable portions of the instructions and or data for use by the on-chip compute circuitry. The second on-chip memory configured to cache portions of instruction and data for current use by the on-chip compute circuitry.

SYSTEM AND METHOD FOR MEMORY COMPRESSION FOR DEEP LEARNING NETWORKS

A system and method for memory compression for deep learning networks. The method includes: compacting an input data stream by identifying a bit width necessary to accommodate the value from the input data stream with the highest magnitude; storing a least significant bits of the input data stream in a first memory store, the number of bits equal to the bit width, wherein if the value requires more bits than those currently left unused in the first memory store, the remaining bits are written into a second memory store; and outputting the value of the first memory store, as a consecutive part of a compressed data stream, with an associated width of the data in the first memory store when the first memory store becomes full and copying the value of the second memory store to the first memory store; and decompressing the compressed data stream.