G06F9/30065

Universal peripheral extender for communicatively connecting peripheral I/O devices and smart host devices
11593158 · 2023-02-28 · ·

A universal peripheral extender architecture, system, and method is disclosed that addresses the need of communicatively connecting peripheral I/O devices and the smart host devices in legacy, medical, and industrial applications. As disclosed, a universal peripheral extender includes an I/O device translation & management module that has a device-side utility, a host-side I/O device translation & management utility, and a host/device translation & management scheduler utility.

STREAMING ENGINE WITH FLEXIBLE STREAMING ENGINE TEMPLATE SUPPORTING DIFFERING NUMBER OF NESTED LOOPS WITH CORRESPONDING LOOP COUNTS AND LOOP OFFSETS
20230053842 · 2023-02-23 ·

A streaming engine employed in a digital data processor specifies a fixed read only data stream defined by plural nested loops. An address generator produces address of data elements for the nested loops. A steam head register stores data elements next to be supplied to functional units for use as operands. A stream template specifies loop count and loop dimension for each nested loop. A format definition field in the stream template specifies the number of loops and the stream template bits devoted to the loop counts and loop dimensions. This permits the same bits of the stream template to be interpreted differently enabling trade off between the number of loops supported and the size of the loop counts and loop dimensions.

MASKING FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURE
20230058355 · 2023-02-23 ·

A reconfigurable compute fabric can include multiple nodes, and each node can include multiple tiles with respective processing and storage elements. The tiles can be arranged in an array or grid and can be communicatively coupled. In an example, a first node can include a tile cluster of N memory-compute tiles, and the N memory-compute tiles can be coupled using a first portion of a synchronous compute fabric. Operations performed by the respective processing and storage elements of the N memory-compute tiles can be selectively enabled or disabled based on information in a mask field of data propagated through the first portion of the synchronous compute fabric.

SYSTEMS AND METHODS FOR EXECUTING A PROGRAMMABLE FINITE STATE MACHINE THAT ACCELERATES FETCHLESS COMPUTATIONS AND OPERATIONS OF AN ARRAY OF PROCESSING CORES OF AN INTEGRATED CIRCUIT

Systems and methods for fetchless acceleration of convolutional loops on an integrated circuit include identifying, by a compiler, finite state machine (FSM) initialization parameters based on computational requirements of a computational loop; initializing a programmable FSM based on the FSM initialization parameters, wherein the FSM initialization parameters include a loop iteration parameter identifying a number of computation cycles of the computational loop; executing the programmable FSM to enable fetchless computations by: generating a plurality of computational loop control signals including a distinct computation loop control signal for each of the number of computation cycles of the computational loop based on the loop iteration parameter; and controlling an execution of a plurality of computation cycles of a computational circuit performing the computational loop based on transmitting the plurality of computational loop control signals until the number of computation cycles of the computation loop are completed.

PROCESSOR WITH TABLE LOOKUP UNIT

A processor includes a scalar processor core and a vector coprocessor core coupled to the scalar processor core. The scalar processor core is configured to retrieve an instruction stream from program storage, and pass vector instructions in the instruction stream to the vector coprocessor core. The vector coprocessor core includes a register file, a plurality of execution units, and a table lookup unit. The register file includes a plurality of registers. The execution units are arranged in parallel to process a plurality of data values. The execution units are coupled to the register file. The table lookup unit is coupled to the register file in parallel with the execution units. The table lookup unit is configured to retrieve table values from one or more lookup tables stored in memory by executing table lookup vector instructions in a table lookup loop.

System and method for decoupling operations to accelerate processing of loop structures

An apparatus for hardware acceleration for use in operating a computational network is configured for determining that a loop structure including one or more loops is to be executed by a first processor. Each of the one or more loops includes a set of operations. The loop structure may be configured as a nested loop, a cascaded or a combination of the two. A second processor may be configured to decouple overhead operations of the loop structure from compute operations of the loop structure. The apparatus accelerates processing of the loop structure by simultaneously processing the overhead operations using the second processor separately from processing the compute operations based on the configuration to operate the computational network.

Offloading execution of a multi-task parameter-dependent operation to a network device
20230089099 · 2023-03-23 ·

A processing device includes an interface and one or more processing circuits. The interface is to connect to a host processor. The one or more processing circuits are to receive from the host processor, via the interface, a notification specifying an operation for execution by the processing device, the operation including (i) multiple tasks that are executable by the network device, and (ii) execution dependencies among the tasks, in response to the notification, to determine a schedule for executing the tasks, the schedule complying with the execution dependencies, and to execute the operation by executing the tasks of the operation in accordance with the schedule.

METHOD AND APPARATUS FOR IMPLIED BIT HANDLING IN FLOATING POINT MULTIPLICATION
20230085048 · 2023-03-16 ·

A method is provided that includes performing, by a processor in response to a floating point multiply instruction, multiplication of floating point numbers, wherein determination of values of implied bits of leading bit encoded mantissas of the floating point numbers is performed in parallel with multiplication of the encoded mantissas, and storing, by the processor, a result of the floating point multiply instruction in a storage location indicated by the floating point multiply instruction.

Streaming engine with flexible streaming engine template supporting differing number of nested loops with corresponding loop counts and loop offsets
11481327 · 2022-10-25 · ·

A streaming engine employed in a digital data processor specifies a fixed read only data stream defined by plural nested loops. An address generator produces address of data elements for the nested loops. A steam head register stores data elements next to be supplied to functional units for use as operands. A stream template specifies loop count and loop dimension for each nested loop. A format definition field in the stream template specifies the number of loops and the stream template bits devoted to the loop counts and loop dimensions. This permits the same bits of the stream template to be interpreted differently enabling trade off between the number of loops supported and the size of the loop counts and loop dimensions.

Command processor based multi dispatch scheduler
11481953 · 2022-10-25 · ·

Described herein are techniques for performing ray tracing operations. A command processor executes custom instructions for orchestrating a ray tracing pipeline. The custom instructions cause the command processor to perform a series of loop iterations, each at a particular recursion depth. In a first loop iteration, a ray generation shader is executed that triggers execution of a trace ray operation. In any other iteration, zero or more shaders are executed based on the contents of a shader queue. Any shader may trigger execution of a trace ray operation. The trace ray operation determines whether a ray specified by the shader intersects a triangle. The ray trace operation places shader entries into a shader queue, at the current recursion depth plus 1. The command processor updates the current recursion depth based on whether a trace ray operation is executed. The loop ends when the recursion depth is less than a threshold.