G06F2212/454

COMPUTER-READABLE RECORDING MEDIUM STORING DATA PLACEMENT PROGRAM, PROCESSOR, AND DATA PLACEMENT METHOD
20220405204 · 2022-12-22 · ·

A data placement program causes a computer to execute a process of data placement in a main memory and a cache. When performing an operation using a first data groups and second data groups to generate pieces of operation result data representing operation results of the operation, based on a size of one piece of the operation result data and a size of an operation result area storing some of the plurality of pieces of operation result data in the cache memory, determining a number of the first data groups and a number of the second data groups, both corresponding to the some pieces of operation result data, and placing the plurality of first data groups and the plurality of second data groups in the main memory based on the determined number of the first data groups and the determined number of the second data groups.

STREAMING ENGINE WITH FLEXIBLE STREAMING ENGINE TEMPLATE SUPPORTING DIFFERING NUMBER OF NESTED LOOPS WITH CORRESPONDING LOOP COUNTS AND LOOP OFFSETS
20230053842 · 2023-02-23 ·

A streaming engine employed in a digital data processor specifies a fixed read only data stream defined by plural nested loops. An address generator produces address of data elements for the nested loops. A steam head register stores data elements next to be supplied to functional units for use as operands. A stream template specifies loop count and loop dimension for each nested loop. A format definition field in the stream template specifies the number of loops and the stream template bits devoted to the loop counts and loop dimensions. This permits the same bits of the stream template to be interpreted differently enabling trade off between the number of loops supported and the size of the loop counts and loop dimensions.

SYSTEMS, METHODS, AND APPARATUS FOR TRANSFERRING DATA BETWEEN INTERCONNECTED DEVICES

A method for transferring data may include writing, from a producing device, data to a storage device through an interconnect, determining a consumer device for the data, prefetching the data from the storage device, and transferring, based on the determining, the data to the consumer device through the interconnect. The method may further comprise receiving, at a prefetcher for the storage device, an indication of a relationship between the producing device and the consumer device, and determining the consumer device based on the indication. The method may further comprise placing the data in a stream at the storage device based on the relationship between the producing device and the consumer device. The indication may be provided by an application associated with the consumer device. Receiving the indication may include receiving the indication through a coherent memory protocol for the interconnect.

Method, device and storage medium for processing overhead of memory access

A method for processing overhead of memory access includes: applying for a memory configured to perform value padding on at least one convolution operation in a deep learning model; determining input data of the deep learning model; performing deep learning processing on the input data by using the deep learning model; and releasing the memory after performing the deep learning processing.

Pseudo-first in, first out (FIFO) tag line replacement

A method is provided that includes searching tags in a tag group comprised in a tagged memory system for an available tag line during a clock cycle, wherein the tagged memory system includes a plurality of tag lines having respective tags and wherein the tags are divided into a plurality of non-overlapping tag groups, and searching tags in a next tag group of the plurality of tag groups for an available tag line during a next clock cycle when the searching in the tag group does not find an available tag line.

Adaptive Address Tracking
20230052043 · 2023-02-16 · ·

Described apparatuses and methods track access metadata pertaining to activity within respective address ranges. The access metadata can be used to inform prefetch operations within the respective address ranges. The prefetch operations may involve deriving access patterns from access metadata covering the respective ranges. Suitable address range sizes for accurate pattern detection, however, can vary significantly from region to region of the address space based on, inter alia, workloads produced by programs utilizing the regions. Advantageously, the described apparatuses and methods can adapt the address ranges covered by the access metadata for improved prefetch performance. A data structure may be used to manage the address ranges in which access metadata are tracked. The address ranges can be adapted to improve prefetch performance through low-overhead operations implemented within the data structure. The data structure can encode hierarchical relationships that ensure the resulting address ranges are distinct.

Memory shapes

A user definition of a memory shape can be received and a multidimensional, contiguous, physical portion of a memory array can be allocated according to the memory shape. The user definition of the memory shape can include a quantity of contiguous columns of the memory array, a quantity of contiguous rows of the memory array, and a major dimension of the memory shape. The major dimension can correspond to a dimension by which to initially stride data stored in the memory shape.

METHODS AND APPARATUS FOR ALLOCATION IN A VICTIM CACHE SYSTEM

Methods, apparatus, systems and articles of manufacture are disclosed for allocation in a victim cache system. An example apparatus includes a first cache storage, a second cache storage, a cache controller coupled to the first cache storage and the second cache storage and operable to receive a memory operation that specifies an address, determine, based on the address, that the memory operation evicts a first set of data from the first cache storage, determine that the first set of data is unmodified relative to an extended memory, and cause the first set of data to be stored in the second cache storage.

Deep-learning-based image processing method and system

An image processing method is provided, which is applied to a deep learning model. A cache queue is provided in front of each layer of the deep learning model; a plurality of computation tasks are preset for each layer of the deep learning model in advance, and are configured for computing weight parameters and corresponding to-be-processed data in a plurality of channels in each corresponding layer in parallel, and storing a computation result into a cache queue behind the corresponding layer thereof; in addition, as long as the cache queue in front of the layer includes the computation result stored in the previous layer, the layer can obtain the to-be-processed data from the computation result, subsequent computation is performed, and a parallel pipeline computation mode is also formed between the layers. By means of the mode, the throughput rate during image processing is remarkably improved, and the image processing parallelism degree and speed and the computation performance of the deep learning model are improved. Further provided are an image processing device and system, which have the same beneficial effects as the above image processing method.

SYSTEM AND METHOD FOR IMPLEMENTING A NETWORK-INTERFACE-BASED ALLREDUCE OPERATION

An apparatus is provided that includes a network interface to transmit and receive data packets over a network; a memory including one or more buffers; an arithmetic logic unit to perform arithmetic operations for organizing and combining the data packets; and a circuitry to receive, via the network interface, data packets from the network; aggregate, via the arithmetic logic unit, the received data packets in the one or more buffers at a network rate; and transmit, via the network interface, the aggregated data packets to one or more compute nodes in the network, thereby optimizing latency incurred in combining the received data packets and transmitting the aggregated data packets, and hence accelerating a bulk data allreduce operation. One embodiment provides a system and method for performing the allreduce operation. During operation, the system performs the allreduce operation by pacing network operations for enhancing performance of the allreduce operation.