Patent classifications
G06F12/0284
DETERMINISTIC MEMORY FOR TENSOR STREAMING PROCESSORS
Embodiments are directed to a deterministic streaming system with one or more deterministic streaming processors each having an array of processing elements and a first deterministic memory coupled to the processing elements. The deterministic streaming system further includes a second deterministic memory with multiple data banks having a global memory address space, and a controller. The controller initiates retrieval of first data from the data banks of the second deterministic memory as a first plurality of streams, each stream of the first plurality of streams streaming toward a respective group of processing elements of the array of processing elements. The controller further initiates writing of second data to the data banks of the second deterministic memory as a second plurality of streams, each stream of the second plurality of streams streaming from the respective group of processing elements toward a respective data bank of the second deterministic memory.
DYNAMICALLY TUNING HOST PERFORMANCE BOOSTER THRESHOLDS
Methods, systems, and devices for dynamically tuning host performance booster thresholds are described. A memory system may include a set of memory devices and an interface configured to communicate commands with a host system coupled with the memory system. The interface may communicate commands to the memory system according to a first command mode associated with a logical address space including a plurality of regions and communicate commands according to a second command mode associated with physical memory address. The memory system may further include a controller that may determine a region activated for the second command mode, receive a first plurality of commands, determine, upon deactivating the region, a first threshold based on a first quantity of read commands serviced according to the second command mode. The controller may activate the region for the second command based on a second quantity of read commands received exceeding the first threshold.
Methods and arrangements to manage memory in cascaded neural networks
Logic may reduce the size of runtime memory for deep neural network inference computations. Logic may determine, for two or more stages of a neural network, a count of shared block allocations, or shared memory block allocations, that concurrently exist during execution of the two or more stages. Logic may compare counts of the shared block allocations to determine a maximum count of the counts. Logic may reduce inference computation time for deep neural network inference computations. Logic may determine a size for each of the shared block allocations of the count of shared memory block allocations, to accommodate data to store in a shared memory during execution of the two or more stages of the cascaded neural network. Logic may determine a batch size per stage of the two or more stages of a cascaded neural network based on a lack interdependencies between input data.
COMPUTATIONAL MEMORY
An example device includes a plurality of computational memory banks. Each computational memory bank of the plurality of computational memory banks includes an array of memory units and a plurality of processing elements connected to the array of memory units. The device further includes a plurality of single instruction, multiple data (SIMD) controllers. Each SIMD controller of the plurality of SIMD controllers is contained within at least one computational memory bank of the plurality of computational memory banks. Each SIMD controller is to provide instructions to the at least one computational memory bank.
SYSTEMS AND METHODS FOR READING AND WRITING SPARSE DATA IN A NEURAL NETWORK ACCELERATOR
Disclosed herein includes a system, a method, and a device for reading and writing sparse data in a neural network accelerator. A mask identifying byte positions within a data word having non-zero values in memory can be accessed. Each bit of the mask can have a first value or a second value, the first value indicating that a byte of the data word corresponds to a non-zero byte value, the second value indicating that the byte of the data word corresponds to a zero byte value. The data word can be modified to have non-zero byte values stored at an end of a first side of the data word in the memory, and any zero byte values stored in a remainder of the data word. The modified data word can be written to the memory via at least a first slice of a plurality of slices that is configured to access the first side of the data word in the memory.
Banked memory architecture for multiple parallel datapath channels in an accelerator
The present disclosure relates to devices and methods for using a banked memory structure with accelerators. The devices and methods may segment and isolate dataflows in datapath and memory of the accelerator. The devices and methods may provide each data channel with its own register memory bank. The devices and methods may use a memory address decoder to place the local variables in the proper memory bank.
Method for transferring packets of a communication protocol
A method for transferring packets of a communication protocol via a memory-based interface between two processing units. The method includes providing, in each of the processing units, a send area including a read index section, a write index section, and a send buffer, and a receive area including a read index section, a write index section and a receive buffer. Each processing unit repeats as sending steps: reading a read index from the receive area; writing at least one send packet into the send buffer (from a starting write address to an ending write address, the ending write address maximally corresponding to a buffer address assigned to the read read index, and writing a changed write index into the send area.
MEMORY TIERING TECHNIQUES IN COMPUTING SYSTEMS
Techniques of memory tiering in computing devices are disclosed herein. One example technique includes retrieving, from a first tier in a first memory, data from a data portion and metadata from a metadata portion of the first tier upon receiving a request to read data corresponding to a system memory section. The method can then include analyzing the data location information to determine whether the first tier currently contains data corresponding to the system memory section in the received request. In response to determining that the first tier currently contains data corresponding to the system memory section in the received request, transmitting the retrieved data from the data portion of the first memory to the processor in response to the received request. Otherwise, the method can include identifying a memory location in the first or second memory that contains data corresponding to the system memory section and retrieving the data from the identified memory location.
ARRAY ACCESS WITH RECEIVER MASKING
Methods, systems, and devices for array access with receiver masking are described. A first device may issue to a second device a first sequence of write commands for a set of data. The first sequence of write commands may indicate different memory addresses in an order. After issuing the first sequence of write commands, the first device may issue to the second device a second sequence of read commands for the set of data. The second sequence of read commands may indicate the different memory addresses in the same order as the first sequence of write commands. Based on issuing the second sequence of read commands, the first device may receive the set of data from the second device.
Performing multiple point table lookups in a single cycle in a system on chip
In various examples, a VPU and associated components may be optimized to improve VPU performance and throughput. For example, the VPU may include a min/max collector, automatic store predication functionality, a SIMD data path organization that allows for inter-lane sharing, a transposed load/store with stride parameter functionality, a load with permute and zero insertion functionality, hardware, logic, and memory layout functionality to allow for two point and two by two point lookups, and per memory bank load caching capabilities. In addition, decoupled accelerators may be used to offload VPU processing tasks to increase throughput and performance, and a hardware sequencer may be included in a DMA system to reduce programming complexity of the VPU and the DMA system. The DMA and VPU may execute a VPU configuration mode that allows the VPU and DMA to operate without a processing controller for performing dynamic region based data movement operations.