G06F15/7821

Memory devices and methods which may facilitate tensor memory access with memory maps based on memory operations

Examples described herein include systems and methods which include an apparatus comprising a memory array including a plurality of memory cells and a memory controller coupled to the memory array. The memory controller comprises a memory mapper configured to configure a memory map on the basis of a memory command associated with a memory access operation. The memory map comprises a specific sequence of memory access instructions to access at least one memory cell of the memory array. For example, the specific sequence of memory access instructions for a diagonal memory command comprises a sequence of memory access instructions that each access a memory cell along a diagonal of the memory array.

PROCESSING-IN-MEMORY CONCURRENT PROCESSING SYSTEM AND METHOD
20230099163 · 2023-03-30 ·

A processing system includes a processing unit and a memory device. The memory device includes a processing-in-memory (PIM) module that performs processing operations on behalf of the processing unit. An instruction set architecture (ISA) of the PIM module has fewer instructions than an ISA of the processing unit. Instructions received from the processing unit are translated such that processing resources of the PIM module are virtualized. As a result, the PIM module concurrently performs processing operations for multiple threads or applications of the processing unit.

METHOD, SYSTEM AND DEVICE FOR PARALLEL PROCESSING OF DATA, AND STORAGE MEDIUM
20230035910 · 2023-02-02 ·

Provided are a method for the parallel processing of data, a device, and a storage medium. The method includes: identifying, from multiple first computing nodes, at least three first computing nodes which have a logical relationship are identified and defining the at least three first computing nodes which have the logical relationship as a first parallel node group, where the first parallel node group includes a first preceding node and at least two first subsequent nodes; acquiring a first input data model of the first preceding node and generating a first input tensor of the first preceding node; computing a first output tensor of the first preceding node according to the first input data model and the first input tensor; and acquiring a second input data model of the at least two first subsequent nodes and using the first output tensor as a second input tensor.

CONTENT-ADDRESSABLE PROCESSING ENGINE

A content-addressable processing engine, also referred to herein as CAPE, is provided. Processing-in-memory (PIM) architectures attempt to overcome the von Neumann bottleneck by combining computation and storage logic into a single component. CAPE provides a general-purpose PIM microarchitecture that provides acceleration of vector operations while being programmable with standard reduced instruction set computing (RISC) instructions, such as RISC-V instructions with standard vector extensions. CAPE can be implemented as a standalone core that specializes in associative computing, and that can be integrated in a tiled multicore chip alongside other types of compute engines. Certain embodiments of CAPE achieve average speedups of 14× (up to 254×) over an area-equivalent out-of-order processor core tile with three levels of caches across a diverse set of representative applications.

Memory device for performing in-memory processing

A memory device includes: in-memory operation units to perform in-memory processing of an operation pipelined in multi-pipeline stages; memory banks assigned to the plurality of in-memory operation units such that a set of n memory banks is assigned to each of the in-memory operation units, each memory bank performing an access operation of data requested by each of the plurality of in-memory operation units while the pipelined operation is performed, wherein n is a natural number; and a memory die in which the in-memory operation units, the memory banks, and command pads configured to receive a command signal from an external source are arranged. Each set of the n memory banks includes a first memory bank having a first data transmission distance to the command pads and a second memory bank having a second data transmission distance to the command pads that is larger than the first data transmission distance.

Computational memory

An example device includes a plurality of computational memory banks. Each computational memory bank of the plurality of computational memory banks includes an array of memory units and a plurality of processing elements connected to the array of memory units. The device further includes a plurality of single instruction, multiple data (SIMD) controllers. Each SIMD controller of the plurality of SIMD controllers is contained within at least one computational memory bank of the plurality of computational memory banks. Each SIMD controller is to provide instructions to the at least one computational memory bank.

Packet identification (ID) assignment for routing network
11615052 · 2023-03-28 · ·

Some examples described herein relate to packet identification (ID) assignment for a routing network in a programmable integrated circuit (IC). In an example, a design system includes a processor and a memory coupled to the processor. The memory stores instruction code. The processor is configured to execute the instruction code to construct an interference graph based on routes of logical nets through switches in a routing network, and assign identifications to the routes comprising performing vertex coloring of vertices of the interference graph. The interference graph includes the vertices and interference edges. Each vertex represents one of the logical nets having a route. Each interference edge connects two vertices that represent corresponding two logical nets that have routes that share at least one port of a switch. The identifications correspond to values assigned to the vertices by the vertex coloring.

Processing of universal number bit strings accumulated in memory array periphery
11487699 · 2022-11-01 · ·

Systems, apparatuses, and methods related to bit string accumulation in memory array periphery are described. Control circuitry (e.g., a processing device) may be utilized to control performance of operations using bit strings within a memory device. Results of the operations may be accumulated in circuitry peripheral to a memory array of the memory device. For instance, a method for bit string accumulation in memory array periphery can include performing a first operation using a first bit string and a second bit string and retrieving a third bit string from a memory array or a storage location located in the periphery of the memory array. The method can further include performing a second operation using the result of the first operation and the third bit string and storing the result of the second operation in the storage location located in the periphery of the memory array.

Discrete Three-Dimensional Processor

A discrete three-dimensional (3-D) processor comprises communicatively coupled first and second dice. The first die comprises 3-D memory (3D-M) arrays, whereas the second die comprises at least a non-memory circuit and at least an off-die peripheral-circuit component of the 3D-M arrays. The first die does not comprise said off-die peripheral-circuit component. The non-memory circuit on the second die is not part of a memory.

ELIMINATING MEMORY BOTTLENECKS FOR DEPTHWISE CONVOLUTIONS

Certain aspects of the present disclosure provide techniques for efficient depthwise convolution. A convolution is performed with a compute-in-memory (CIM) array to generate CIM output, and at least a portion of the CIM output corresponding to a first output data channel, of a plurality of output data channels in the CIM output, is written to a digital multiply-accumulate (DMAC) activation buffer. A patch of the CIM output is read from the DMAC activation buffer, and weight data is read from a DMAC weight buffer. Multiply-accumulate (MAC) operations are performed with the patch of CIM output and the weight data to generate a DMAC output.