G06F9/3802

Adaptive matrix multiplication accelerator for machine learning and deep learning applications

An adaptive matrix multiplier. In some embodiments, the matrix multiplier includes a first multiplying unit a second multiplying unit, a memory load circuit, and an outer buffer circuit. The first multiplying unit includes a first inner buffer circuit and a second inner buffer circuit, and the second multiplying unit includes a first inner buffer circuit and a second inner buffer circuit. The memory load circuit is configured to load data from memory, in a single burst of a burst memory access mode, into the first inner buffer circuit of the first multiplying unit; and into the first inner buffer circuit of the second multiplying unit.

DECOUPLED ACCESS-EXECUTE PROCESSING AND PREFETCHING CONTROL
20230120783 · 2023-04-20 ·

Apparatuses and methods are provided, relating to the control of data processing in devices which comprise both decoupled access-execute processing circuitry and prefetch circuitry. Control of the access portion of the decoupled access-execute processing circuitry may be dependent on a performance metric of the prefetch circuitry. Alternatively or in addition, control of the prefetch circuitry may be dependent on a performance metric of the access portion.

METHOD AND SYSTEM FOR DISTRIBUTING INSTRUCTIONS IN RECONFIGURABLE PROCESSOR AND STORAGE MEDIUM
20230068463 · 2023-03-02 ·

The disclosure provides a method for distributing instructions in a reconfigurable processor. The reconfigurable processor includes an instruction fetch module, an instruction sync control module and an instruction queue module. The method includes: configuring a format of a Memory Sync ID Table of each instruction type, obtaining a first memory identification field and a second memory identification field of each instruction, obtaining one-hot encodings of first and second memory identification fields, obtaining a sync table and executing each instruction of a plurality of to-be-run instructions.

METHOD AND SYSTEM FOR IMPLEMENTING REMAINDER INSTRUCTION OF RISC-V INSTRUCTION SET

The invention relates to the technical field of a microprocessor, in particular to a method and a system for realizing the residual instruction of the RISC-V instruction set. The invention executes the CPU out of order, and the instruction enters the instruction decoding unit from the fetch unit to carry out instruction decoding; the instruction after decoding is renamed in the renaming unit, and the remainder instruction is optimized at the same time. If the remainder instruction does not meet the optimization condition, the renamed instruction enters the reservation station and then enters the execution unit for execution; the executed instruction is submitted through the reordering cache and the division instruction encoding cache resources allocated in the renaming phase are released. In the renaming stage, the invention realizes the function of the remainder instruction by adding the residue instruction acceleration unit.

METHODS AND APPARATUS FOR PREDICTING INSTRUCTIONS FOR EXECUTION

Aspects of the present disclosure relate to an apparatus comprising prediction circuitry having a plurality of hierarchical prediction units to perform respective hierarchical predictions of instructions for execution, wherein predictions higher in the hierarchy have a higher expected accuracy than predictions lower in the hierarchy. Responsive to a given prediction higher in the hierarchy being different to a corresponding prediction lower in the hierarchy, the corresponding prediction lower in the hierarchy is corrected. A prediction correction metric determination unit determines a prediction correction metric indicative of an incidence of uncorrected predictions performed by the prediction circuitry. Fetch circuitry fetches instructions predicted by at least one of said plurality of hierarchical predictions, and delays said fetching based on the prediction correction metric indicating an incidence of uncorrected predictions below a threshold.

SPECULATIVE RESOLUTION OF LAST BRANCH-ON-COUNT AT FETCH

A computer processor includes an instruction pipeline configured to dispatch a plurality of branch-to-count (BCNT) instructions and an instruction fetch unit (IFU). The IFU is configured to execute an instruction loop for fetching a targeted number of BCNT instructions from the instruction pipeline and to monitor a loop counter that counts a number of fetched BCNT instructions that are actually fetched from the instruction pipeline in response to executing the instruction loop. The IFU resolves a final BCNT instruction included in the instruction loop in response to the number of fetched BCNT instructions reaching a target loop count value.

Instruction Cache for Hardware Multi-Thread Microprocessor
20230066662 · 2023-03-02 · ·

Embodiments are provided for instructions cache system for a hardware multi-thread microprocessor. In some embodiments, a cache controller device includes multiple interfaces connected to a hardware multi-thread microprocessor. A first interface of the multiple interfaces can receive a fetch request from a first execution thread during a first clock cycle. A second interface of the multiple interfaces can receive a fetch request from a second execution thread during a second clock cycle after the first clock cycle. The cache controller device also includes a multiplexer to send first response signals in response to the fetch request from the first execution thread, and also to send second response signals in response to the fetch request from the second execution thread.

Cooperative Instruction Prefetch on Multicore System

Aspects of the disclosure are directed to methods, systems, and apparatuses using an instruction prefetch pipeline architecture that provides good performance without the complexity of a full cache coherent solution deployed in conventional CPUs. The architecture can include components which can be used to construct an instruction prefetch pipeline, including instruction memory (TiMem), instruction buffer (iBuf), a prefetch unit, and an instruction router.

METHOD AND SYSTEM FOR HARDWARE-ASSISTED PRE-EXECUTION
20230061576 · 2023-03-02 ·

One aspect provides a system for hardware-assisted pre-execution. During operation, the system determines a pre-execution code region comprising one or more instructions. The system increments a global counter upon initiating the one or more instructions. The system issues a first instruction, which involves setting, in a first entry for the first instruction in a data structure, a first prefetch region identifier with a current value of the global counter. Responsive to a head pointer of the data structure reaching the first entry, the system: determines, based on a non-zero value for the first prefetch region identifier, that the first entry is not available to be allocated; and advances the head pointer to a next entry in the data structure, which renders a load associated with the first entry as a non-blocking load. The system resets the global counter upon completing the one or more instructions.

METHOD AND SYSTEM FOR OPTIMIZING DATA TRANSFER FROM ONE MEMORY TO ANOTHER MEMORY
20230111058 · 2023-04-13 · ·

A method and system for moving data from a source memory to a destination memory by a processor is disclosed herein. The destination memory stores a sequence of instructions and the sequence of instructions comprises one or more load instructions and one or more store instructions. The processor initially moves the one or more store instructions from the destination memory to the source memory. The processor then executes the one or more load instructions from the destination memory. On executing the one or more load instructions, the data is loaded from the source memory to at least one register in the processor. The processor further initiates execution of the one or more store instructions stored in the source memory. On executing the one or more store instructions from the source memory, the processor stores the data from the at least one register to the destination memory.