Patent classifications
G06F15/785
Configurable storage circuits with embedded processing and control circuitry
An integrated circuit may have configurable storage blocks. A configurable storage block may include a memory array, a processing circuit, and a configurable control circuit. The configurable storage block may receive an instruction which may be decoded in the control block to identify a command. The command may be associated with a pre-defined sequence of operations that the control block executes by directing the memory array to perform memory access operations and the processing circuit to execute data processing operations. These data processing operations may be executed on data retrieved during memory access operations, data received subsequent to receiving the instruction, or previously computed data. The processed data may be provided for further processing outside the configurable storage block or stored in the memory array. The configurable storage block may further have delay blocks to allow for delayed memory access to the memory array.
Near-memory data reorganization engine
A memory subsystem package is provided that has processing logic for data reorganization within the memory subsystem package. The processing logic is adapted to reorganize data stored within the memory subsystem package. In some embodiments, the memory subsystem package includes memory units, a memory interconnect, and a data reorganization engine (DRE). The data reorganization engine includes a stream interconnect and DRE units including a control processor and a load-store unit. The control processor is adapted to execute instructions to control a data reorganization. The load-store unit is adapted to process data move commands received from the control processor via the stream interconnect for loading data from a load memory address of a memory unit and storing data to a store memory address of a memory unit.
Processor-guided execution of offloaded instructions using fixed function operations
Processor-guided execution of offloaded instructions using fixed function operations is disclosed. Instructions designated for remote execution by a target device are received by a processor. Each instruction includes, as an operand, a target register in the target device. The target register may be an architected virtual register. For each of the plurality of instructions, the processor transmits an offload request in the order that the instructions are received. The offload request includes the instruction designated for remote execution. The target device may be, for example, a processing-in-memory device or an accelerator coupled to a memory.
COMPARISON OPERATIONS IN MEMORY
Examples of the present disclosure provide apparatuses and methods related to performing comparison operations in a memory. An example apparatus might include a first group of memory cells coupled to a first access line and configured to store a first element. An example apparatus might also include a second group of memory cells coupled to a second access line and configured to store a second element. An example apparatus might also include sensing circuitry configured to compare the first element with the second element by performing a number of AND operations, OR operations, SHIFT operations, and INVERT operations without transferring data via an input/output (I/O) line.
SMALLEST OR LARGEST VALUE ELEMENT DETERMINATION
Examples of the present disclosure provide apparatuses and methods for smallest value element or largest value element determination in memory. An example method comprises: storing an elements vector comprising a plurality of elements in a group of memory cells coupled to an access line of an array; performing, using sensing circuitry coupled to the array, a logical operation using a first vector and a second vector as inputs, with a result of the logical operation being stored in the array as a result vector; updating the result vector responsive to performing a plurality of subsequent logical operations using the sensing circuitry; and providing an indication of which of the plurality of elements have one of a smallest value and a largest value.
3D-stacked memory with reconfigurable compute logic
A 3D-stacked memory device including: a base die including a plurality of switches to direct data flow and a plurality of arithmetic logic units (ALUs) to compute data; a plurality of memory dies stacked on the base die; and an interface to transfer signals to control the base die.
3-D STACKED MEMORY WITH RECONFIGURABLE COMPUTE LOGIC
A 3D-stacked memory device including: a base die including a plurality of switches to direct data flow and a plurality of arithmetic logic units (ALUs) to compute data; a plurality of memory dies stacked on the base die; and an interface to transfer signals to control the base die.
3D-STACKED MEMORY WITH RECONFIGURABLE COMPUTE LOGIC
A 3D-stacked memory device including: a base die including a plurality of switches to direct data flow and a plurality of arithmetic logic units (ALUs) to compute data; a plurality of memory dies stacked on the base die; and an interface to transfer signals to control the base die.
Tree-Based Network Architecture for Accelerating Machine Learning Collective Operations
Aspects of the disclosure are directed to a tree-based network architecture for serving and/or training machine learning models. The architecture includes one or more multi-chip packages having a plurality of compute-memory stacks connected via an input/output (I/O) die. The I/O die includes an aggregator to aggregate computations from the compute-memory stacks. The architecture can further include a plurality of the multi-chip packages connected on a server via a server level aggregator and a plurality of the servers connected on a rack via a rack level aggregator for further aggregation of the computations from the compute-memory stacks. The tree-based network architecture allows for fewer hops, resulting in lower latency and savings in bandwidth when serving and/or training machine learning models.
Tree-based network architecture for accelerating machine learning collective operations
Aspects of the disclosure are directed to a tree-based network architecture for serving and/or training machine learning models. The architecture includes one or more multi-chip packages having a plurality of compute-memory stacks connected via an input/output (I/O) die. The I/O die includes an aggregator to aggregate computations from the compute-memory stacks. The architecture can further include a plurality of the multi-chip packages connected on a server via a server level aggregator and a plurality of the servers connected on a rack via a rack level aggregator for further aggregation of the computations from the compute-memory stacks. The tree-based network architecture allows for fewer hops, resulting in lower latency and savings in bandwidth when serving and/or training machine learning models.