G06F15/80

Control barrier network for reconfigurable data processors

A processing system comprises a control bus and a plurality of logic units. The control bus is configurable by configuration data to form signal routes in a control barrier network coupled to processing units in an array of processing units. The plurality of logic units has inputs and outputs connected to the control bus and to the array of processing units. A logic unit in the plurality of logic units is operatively coupled to a processing unit in the array of processing units and is configurable by the configuration data to consume source tokens and a status signal from the processing unit on the inputs and to produce barrier tokens and an enable signal on the outputs based on the source tokens and the status signal on the inputs.

Multi-port memory architecture for a systolic array

A memory architecture and a processing unit that incorporates the memory architecture and a systolic array. The memory architecture includes: memory array(s) with multi-port (MP) memory cells; first wordlines connected to the cells in each row; and, depending upon the embodiment, second wordlines connected to diagonals of cells or diagonals of sets of cells. Data from a data input matrix is written to the memory cells during first port write operations using the first wordlines and read out from the memory cells during second port read operations using the second wordlines. Due to the diagonal orientation of the second wordlines and due to additional features (e.g., additional rows of memory cells that store static zero data values or read data mask generators that generate read data masks), data read from the memory architecture and input directly into a systolic array is in the proper order, as specified by a data setup matrix.

System having a hybrid threading processor, a hybrid threading fabric having configurable computing elements, and a hybrid interconnection network
11579887 · 2023-02-14 · ·

Representative apparatus, method, and system embodiments are disclosed for configurable computing. In a representative embodiment, a system includes an interconnection network, a processor, a host interface, and a configurable circuit cluster. The configurable circuit cluster may include a plurality of configurable circuits arranged in an array; an asynchronous packet network and a synchronous network coupled to each configurable circuit of the array; and a memory interface circuit and a dispatch interface circuit coupled to the asynchronous packet network and to the interconnection network. Each configurable circuit includes instruction or configuration memories for selection of a current data path configuration, a master synchronous network input, and a data path configuration for a next configurable circuit.

System having a hybrid threading processor, a hybrid threading fabric having configurable computing elements, and a hybrid interconnection network
11579887 · 2023-02-14 · ·

Representative apparatus, method, and system embodiments are disclosed for configurable computing. In a representative embodiment, a system includes an interconnection network, a processor, a host interface, and a configurable circuit cluster. The configurable circuit cluster may include a plurality of configurable circuits arranged in an array; an asynchronous packet network and a synchronous network coupled to each configurable circuit of the array; and a memory interface circuit and a dispatch interface circuit coupled to the asynchronous packet network and to the interconnection network. Each configurable circuit includes instruction or configuration memories for selection of a current data path configuration, a master synchronous network input, and a data path configuration for a next configurable circuit.

Technologies for providing shared memory for accelerator sleds

Technologies for providing shared memory for accelerator sleds includes an accelerator sled to receive, with a memory controller, a memory access request from an accelerator device to access a region of memory. The request is to identify the region of memory with a logical address. Additionally, the accelerator sled is to determine from a map of logical addresses and associated physical address, the physical address associated with the region of memory. In addition, the accelerator sled is to route the memory access request to a memory device associated with the determined physical address.

Discrete Three-Dimensional Processor

A discrete three-dimensional (3-D) processor comprises stacked first and second dice. The first die comprises 3-D memory (3D-M) arrays, whereas the second die comprises logic circuits and at least an off-die peripheral-circuit component of the 3D-M array(s). In one preferred embodiment, the first and second dice are face-to-face bonded. In another preferred embodiment, the first and second dice have a same die size.

Discrete Three-Dimensional Processor

A discrete three-dimensional (3-D) processor comprises first and second dice. The first die comprises 3-D memory (3D-M) arrays, whereas the second die comprises logic circuits and at least an off-die peripheral-circuit component of the 3D-M array(s). Typical off-die peripheral-circuit component could be an address decoder, a sense amplifier, a programming circuit, a read-voltage generator, a write-voltage generator, a data buffer, or a portion thereof.

Discrete Three-Dimensional Processor

A discrete three-dimensional (3-D) processor comprises stacked first and second dice. The first die comprises three-dimensional memory (3D-M) arrays, whereas the second die comprises at least a portion of a logic/processing circuit and an off-die peripheral-circuit component of the 3D-M array(s). The preferred 3-D processor can be used to compute non-arithmetic function/model. In other applications, the preferred 3-D processor may also be a 3-D configurable computing array, a 3-D pattern processor, or a 3-D neuro-processor.

OFFLOADING PROCESSING TASKS TO DECOUPLED ACCELERATORS FOR INCREASING PERFORMANCE IN A SYSTEM ON A CHIP

In various examples, a VPU and associated components may be optimized to improve VPU performance and throughput. For example, the VPU may include a min/max collector, automatic store predication functionality, a SIMD data path organization that allows for inter-lane sharing, a transposed load/store with stride parameter functionality, a load with permute and zero insertion functionality, hardware, logic, and memory layout functionality to allow for two point and two by two point lookups, and per memory bank load caching capabilities. In addition, decoupled accelerators may be used to offload VPU processing tasks to increase throughput and performance, and a hardware sequencer may be included in a DMA system to reduce programming complexity of the VPU and the DMA system. The DMA and VPU may execute a VPU configuration mode that allows the VPU and DMA to operate without a processing controller for performing dynamic region based data movement operations.

BUILT-IN SELF-TEST FOR A PROGRAMMABLE VISION ACCELERATOR OF A SYSTEM ON A CHIP

In various examples, a VPU and associated components may be optimized to improve VPU performance and throughput. For example, the VPU may include a min/max collector, automatic store predication functionality, a SIMD data path organization that allows for inter-lane sharing, a transposed load/store with stride parameter functionality, a load with permute and zero insertion functionality, hardware, logic, and memory layout functionality to allow for two point and two by two point lookups, and per memory bank load caching capabilities. In addition, decoupled accelerators may be used to offload VPU processing tasks to increase throughput and performance, and a hardware sequencer may be included in a DMA system to reduce programming complexity of the VPU and the DMA system. The DMA and VPU may execute a VPU configuration mode that allows the VPU and DMA to operate without a processing controller for performing dynamic region based data movement operations.