G06F9/3897

METHOD AND SYSTEM FOR OPTIMIZING DATA TRANSFER FROM ONE MEMORY TO ANOTHER MEMORY
20230120354 · 2023-04-20 · ·

A method and system for moving data from a source memory to a destination memory by a processor are disclosed. The processor has a plurality of registers and the source memory stores a sequence of instructions that include one or more load instructions and one or more store instructions. The processor moves the load instructions from the source memory to the destination memory. Then, the processor initiates execution of the load instructions from the destination memory in order to load the data from the source memory to one or more registers in the processor. Execution then returns to the sequence of instructions stored in the source memory, and the processor stores the data from the registers to the destination memory.

High bandwidth memory system with dynamically programmable distribution scheme

A system comprises a processor coupled to a plurality of memory units. Each of the plurality of memory units includes a request processing unit and a plurality of memory banks. The processor includes a plurality of processing elements and a communication network communicatively connecting the plurality of processing elements to the plurality of memory units. At least a first processing element of the plurality of processing elements includes a control logic unit and a matrix compute engine. The control logic unit is configured to access data from the plurality of memory units using a dynamically programmable distribution scheme.

SYSTEM AND METHOD OF EARLY TERMINATION OF LAYER PROCESSING IN AN ARTIFICIAL NEURAL NETWORK
20230161997 · 2023-05-25 ·

A novel and useful system and method of early termination for use in an artificial neural network (ANN). An NN processor incorporates the early termination mechanism that provides the capability of terminating a compute graph in a data flow architecture, e.g., an ANN, earlier than its predefined planned execution. This serves to improve both power consumption and sometimes latency considering the additional operations that are not performed when the network is terminated early. The early termination mechanism is implemented partly in the SDK/compiler offline at compile time and partly at runtime in the NN processor. During compile time, the weights of the neural network are sorted first by output function and then by input function. In operation, the LCU receives feedback from the MAC units in the processing elements (PEs) and if saturation in the MAC outputs is detected and crosses a threshold, it means the calculations performed until that point are sufficient and that additional calculations are not likely to change the results significantly. Thus, early termination for that layer can be triggered thereby saving power and improving latency.

Methods and systems for computing in memory

A method of computing in memory, the method including inputting a packet including data into a computing memory unit having a control unit, loading the data into at least one computing in memory micro-unit, processing the data in the computing in memory micro-unit, and outputting the processed data. Also, a computing in memory system including a computing in memory unit having a control unit, wherein the computing in memory unit is configured to receive a packet having data and a computing in memory micro-unit disposed in the computing in memory unit, the computing in memory micro-unit having at least one of a memory matrix and a logic elements matrix.

Neural processing accelerator

A system for calculating. A scratch memory is connected to a plurality of configurable processing elements by a communication fabric including a plurality of configurable nodes. The scratch memory sends out a plurality of streams of data words. Each data word is either a configuration word used to set the configuration of a node or of a processing element, or a data word carrying an operand or a result of a calculation. Each processing element performs operations according to its current configuration and returns the results to the communication fabric, which conveys them back to the scratch memory.

Out-of-order block-based processors and instruction schedulers using ready state data indexed by instruction position identifiers

Apparatus and methods are disclosed for implementing block-based processors including field programmable gate-array implementations. In one example of the disclosed technology, a block-based processor includes an instruction decoder configured to generate decoded ready dependencies for a transactional block of instructions, where each of the instructions is associated with a different instruction identifier encoded in the transactional block. The processor further includes an instruction scheduler configured to issue an instruction from a set of instructions of the transactional block of instructions. The instruction is issued based on determining that decoded ready state dependencies for an instruction are satisfied. The determining includes accessing storage with the decoded ready dependencies indexed with a respective instruction identifier that is encoded in the transactional block of instructions.

Virtual network pre-arbitration for deadlock avoidance and enhanced performance

A device includes a data path, a first interface configured to receive a first memory access request from a first peripheral device, and a second interface configured to receive a second memory access request from a second peripheral device. The device further includes an arbiter circuit configured to, in a first clock cycle, a pre-arbitration winner between a first memory access request and a second memory access request based on a first number of credits allocated to a first destination device and a second number of credits allocated to a second destination device. The arbiter circuit is further configured to, in a second clock cycle select a final arbitration winner from among the pre-arbitration winner and a subsequent memory access request based on a comparison of a priority of the pre-arbitration winner and a priority of the subsequent memory access request.

MODIFICATIONS TO A STREAM PROCESSING TOPOLOGY DURING PROCESSING OF A DATA STREAM
20170351633 · 2017-12-07 ·

A method, a computing system, and a non-transitory machine readable storage medium containing instructions for managing a stream processing topology are provided. In an example, the method includes receiving a first topology that communicatively couples a plurality of processing elements via a first arrangement of interconnections to perform an operation on a stream of data. A second topology is defined that communicatively couples the plurality of processing elements via a second arrangement of interconnections that is different from the first arrangement. The second topology assigns the plurality of processing elements a first set of operations. The second topology is provided to a stream processing manager and is modified during processing of the stream of data by assigning a second set of operations to the plurality of processing elements that is different from the first set of operations.

ADAPTIVE CREDIT-BASED REPLENISHMENT THRESHOLD USED FOR TRANSACTION ARBITRATION IN A SYSTEM THAT SUPPORTS MULTIPLE LEVELS OF CREDIT EXPENDITURE
20220374358 · 2022-11-24 ·

A device includes an arbiter circuit configured to receive a first request for a resource. The first request is associated with a first credit cost. The arbiter circuit is further configured to receive a second request for the resource. The second request is associated with a second credit cost. The arbiter circuit is further configured to select the first request for the resource as an arbitration winner. The arbiter circuit is further configured to decrement a number of available credits associated with the resource by the first credit cost. The arbiter circuit is further configured to, in response to the number of available credits associated with the resource falling to a lower credit threshold, wait until the number of available credits associated with the resource reaches an upper credit threshold to select an additional arbitration winner for the resource.

HYBRID BLOCK-BASED PROCESSOR AND CUSTOM FUNCTION BLOCKS

Apparatus and methods are disclosed for implementing block-based processors having custom function blocks, including field-programmable gate array (FPGA) implementations. In some examples of the disclosed technology, a dynamically configurable scheduler is configured to issue at least one block-based processor instruction. A custom function block is configured to receive input operands for the instruction and generate ready state data indicating completion of a computation performed for the instruction by the respective custom function block.