G06F9/30

INDEXING EXTERNAL MEMORY IN A RECONFIGURABLE COMPUTE FABRIC
20230053062 · 2023-02-16 ·

Various examples are directed to systems and methods in which a flow controller of a first synchronous flow may receive an instruction to execute a first loop using the first synchronous flow. The flow controller may determine a first iteration index for a first iteration of the first loop. The flow controller may send, to a first compute element of the first synchronous flow, a first synchronous message to initiate a first synchronous flow thread for executing the first iteration of the first loop. The first synchronous message may comprise the iteration index. The first compute element may execute an input/output operation at a first location of a first compute element memory indicated by the first iteration index.

SYSTEMS, METHODS, AND APPARATUS FOR ASSOCIATING COMPUTATIONAL DEVICE FUNCTIONS WITH COMPUTE ENGINES
20230052076 · 2023-02-16 ·

A method may include creating an association identifier based on an association between a computational device function and a compute engine of a computational device, and invoking an execute command to perform an execution of the computational device function using the compute engine, wherein the execute command uses the association identifier. The compute engine may be a first compute engine, and the association may be further between the computational device function and a second compute engine of the computational device. The execute command may perform an execution of the computational device function using the second compute engine. The execution of the computational device function using the first compute engine and the execution of the computational device function using the second compute engine may overlap. The execute command may include the association identifier. The creating the association identifier may include invoking a create association command.

Discrete Three-Dimensional Processor

A discrete three-dimensional (3-D) processor comprises first and second dice. The first die comprises 3-D random-access memory (3D-RAM) arrays, whereas the second die comprises logic circuits and at least an off-die peripheral-circuit component of the 3D-RAM arrays. The first die does not comprise the off-die peripheral-circuit component of the 3D-RAM arrays.

USING A VECTOR PROCESSOR TO CONFIGURE A DIRECT MEMORY ACCESS SYSTEM FOR FEATURE TRACKING OPERATIONS IN A SYSTEM ON A CHIP

In various examples, a VPU and associated components may be optimized to improve VPU performance and throughput. For example, the VPU may include a min/max collector, automatic store predication functionality, a SIMD data path organization that allows for inter-lane sharing, a transposed load/store with stride parameter functionality, a load with permute and zero insertion functionality, hardware, logic, and memory layout functionality to allow for two point and two by two point lookups, and per memory bank load caching capabilities. In addition, decoupled accelerators may be used to offload VPU processing tasks to increase throughput and performance, and a hardware sequencer may be included in a DMA system to reduce programming complexity of the VPU and the DMA system. The DMA and VPU may execute a VPU configuration mode that allows the VPU and DMA to operate without a processing controller for performing dynamic region based data movement operations.

PRE-STAGED INSTRUCTION REGISTERS FOR VARIABLE LENGTH INSTRUCTION SET MACHINE

Methods and systems relating to improved processing architectures with pre-staged instructions are disclosed herein. A disclosed processor includes an instruction memory, at least one functional processing unit, a bus, a set of instruction registers configured to be loaded, using the bus, with a set of pre-staged instructions from the instruction memory, and a logic circuit configured to provide the set of pre-staged instructions from the set of instruction registers to the at least one functional processing unit in response to receiving an instruction from the instruction memory.

BITWISE PRODUCT-SUM ACCUMULATIONS WITH SKIP LOGIC
20230053294 · 2023-02-16 ·

A method, device, and system for performing a partial sum accumulation of a product of input vectors and weight vectors in a wordwise-input and bitwise-weight manner results in a partial accumulated product sum. The partial accumulated product sum is compared with a threshold condition after each weight bit, and when the partial accumulated product sum meets the threshold condition, a skip indicator is asserted to indicate that remaining computations of a sum accumulation are skipped.

SIMD DATA PATH ORGANIZATION TO INCREASE PROCESSING THROUGHPUT IN A SYSTEM ON A CHIP

In various examples, a VPU and associated components may be optimized to improve VPU performance and throughput. For example, the VPU may include a min/max collector, automatic store predication functionality, a SIMD data path organization that allows for inter-lane sharing, a transposed load/store with stride parameter functionality, a load with permute and zero insertion functionality, hardware, logic, and memory layout functionality to allow for two point and two by two point lookups, and per memory bank load caching capabilities. In addition, decoupled accelerators may be used to offload VPU processing tasks to increase throughput and performance, and a hardware sequencer may be included in a DMA system to reduce programming complexity of the VPU and the DMA system. The DMA and VPU may execute a VPU configuration mode that allows the VPU and DMA to operate without a processing controller for performing dynamic region based data movement operations.

REDUCED MEMORY WRITE REQUIREMENTS IN A SYSTEM ON A CHIP USING AUTOMATIC STORE PREDICATION

In various examples, a VPU and associated components may be optimized to improve VPU performance and throughput. For example, the VPU may include a min/max collector, automatic store predication functionality, a SIMD data path organization that allows for inter-lane sharing, a transposed load/store with stride parameter functionality, a load with permute and zero insertion functionality, hardware, logic, and memory layout functionality to allow for two point and two by two point lookups, and per memory bank load caching capabilities. In addition, decoupled accelerators may be used to offload VPU processing tasks to increase throughput and performance, and a hardware sequencer may be included in a DMA system to reduce programming complexity of the VPU and the DMA system. The DMA and VPU may execute a VPU configuration mode that allows the VPU and DMA to operate without a processing controller for performing dynamic region based data movement operations.

PACKING CONDITIONAL BRANCH OPERATIONS
20230052450 · 2023-02-16 ·

Disclosed in some examples, are systems, methods, devices, and machine readable mediums which use improved dynamic programming algorithms to pack conditional branch instructions. Conditional code branches may be modeled as directed acyclic graphs (DAGs) which have a topological ordering. These DAGs may be used to construct a dynamic programming table to find a partial mapping of one path onto the other path using dynamic programming algorithms.

DATA INPUT/OUTPUT OPERATIONS DURING LOOP EXECUTION IN A RECONFIGURABLE COMPUTE FABRIC
20230050687 · 2023-02-16 ·

Various examples are directed to systems and methods in which a first flow controller of a first synchronous flow may receive an instruction to execute a first loop using the first synchronous flow. The first flow controller may determine a first iteration index for a first iteration of the first loop. The first flow controller may send, to a first compute element of the first synchronous flow, a first synchronous message to initiate a first synchronous flow thread for executing the first iteration of the first loop. The first synchronous message may comprise the iteration index. The first compute element may execute an input/output operation at a first location of a first compute element memory indicated by the first iteration index.