G06F15/7889

System and method for field programmable gate array-assisted binary translation

Binary translation may be performed by a field programmable gate array (FPGA) integrated with a processor as a single integrated circuit. The FPGA contains multiple blocks of logic for performing different binary translations. The processor may offload the binary translation to the FPGA. The FPGA may use historical logging to skip the binary translation of source instructions that have been previously translated into target instructions.

High throughput circuit architecture for hardware acceleration

A hardware acceleration device can include a switch communicatively linked to a host central processing unit (CPU), an adapter coupled to the switch via a control bus, wherein the control bus is configured to convey addresses of descriptors from the host central CPU to the adapter, and a random-access memory (RAM) coupled to the switch through a data bus. The RAM is configured to store descriptors received from the host CPU via the data bus. The hardware acceleration device can include a compute unit coupled to the adapter and configured to perform operations specified by the descriptors. The adapter may be configured to retrieve the descriptors from the RAM via the data bus, provide arguments from the descriptors to the compute unit, and provide control signals to the compute unit to initiate the operations using the arguments.

CUSTOM COMPUTE CORES IN INTEGRATED CIRCUIT DEVICES
20220083487 · 2022-03-17 ·

A system includes a processor and a hardware accelerator coupled to the processor. The hardware accelerator includes data analysis elements configured to analyze a data stream based on configuration data and to output a result, and an integrated circuit device that includes a DMA engine that writes input data to and read output data from the data analysis elements, one or more preprocessing cores that receive the input data from the DMA engine prior to the DMA engine writing the input data to the one or more data analysis elements and perform custom preprocessing functions on the input data, and one or more post-processing cores that receive the output data from the DMA engine after the output data is read from the data analysis elements but prior to the output data being output to the processor and perform custom post-processing functions on the output data.

Reconfigurable Parallel Processing
20220100701 · 2022-03-31 ·

Processors, systems and methods are provided for thread level parallel processing. A processor may comprise a plurality of processing elements (PEs) that each may comprise a configuration buffer, a sequencer coupled to the configuration buffer of each of the plurality of PEs and configured to distribute one or more PE configurations to the plurality of PEs, and a gasket memory coupled to the plurality of PEs and being configured to store at least one PE execution result to be used by at least one of the plurality of PEs during a next PE configuration.

Multiple dies hardware processors and methods

Methods and apparatuses relating to hardware processors with multiple interconnected dies are described. In one embodiment, a hardware processor includes a plurality of physically separate dies, and an interconnect to electrically couple the plurality of physically separate dies together. In another embodiment, a method to create a hardware processor includes providing a plurality of physically separate dies, and electrically coupling the plurality of physically separate dies together with an interconnect.

Packet Transmission Method and Apparatus
20220114138 · 2022-04-14 ·

A packet transmission apparatus includes a processor such as a central processing unit (CPU), a first processing chip, and a second processing chip. The second processing chip is separately connected to the processor and the first processing chip. The second processing chip is disposed between the processor and the first processing chip. The first processing chip is a non-programmable chip such as an application-specific integrated circuit (ASIC) chip, and the second processing chip is a programmable chip such as a field-programmable gate array (FPGA) chip. The second processing chip supports a second function, and the second function is updatable. Both the processor and the first processing chip are configured to exchange a packet with the second processing chip. The second processing chip is configured to process a received packet based on the second function and send the processed packet to the processor or the first processing chip.

Out-of-band management of FPGA bitstreams

Mechanisms for out-of-band (OOB) management of Field Programmable Gate Array (FPGA) bitstreams and associated methods, apparatus, systems and firmware. Under a first OOB mechanism, a management component, such as a baseband management controller (BMC) is coupled to a processor including an agent in a compute node that includes an FGPA. An FPGA bitstream file is provided to the BMC, and the agent reads the file from the BMC and streams the FPGA bitstream contents in the file to the FPGA to program it. Under second and third OOB mechanisms, a pointer to an FPGA bitstream file that identifies the location of the file that is accessible via a network or fabric is provided to the BMC or other management entity. The BMC/management entity forwards the pointer to BIOS running on the compute node or an agent on the processor. The BIOS or agent then uses the pointer to retrieve the FPGA bitstream file via the network or fabric, as applicable, and streams the FPGA bitstream to the FPGA to program it.

HARDWARE OFFLOAD CIRCUITRY

Examples described herein relate to offload circuitry comprising one or more compute engines that are configurable to perform a workload offloaded from a process executed by a processor based on a descriptor particular to the workload. In some examples, the offload circuitry is configurable to perform the workload, among multiple different workloads. In some examples, the multiple different workloads include one or more of: data transformation (DT) for data format conversion, Locality Sensitive Hashing (LSH) for neural network (NN), similarity search, sparse general matrix-matrix multiplication (SpGEMM) acceleration of hash based sparse matrix multiplication, data encode, data decode, or embedding lookup.

Three-Dimensional Stacked Programmable Logic Fabric and Processor Design Architecture

The present disclosure is directed to 3-D stacked architecture for Programmable Fabrics and Central Processing Units (CPUs). The 3-D stacked orientation enables reconfigurability of the fabric, and allows the fabric to function using coarse-grained and fine-grained acceleration for offloading CPU processing. Additionally, the programmable fabric may be able to function to interface with multiple other compute chiplet components in the 3-D stacked orientation. This enables multiple compute components to communicate without the need for offloading the data communications between the compute chiplets.

Shared memory access for reconfigurable parallel processor using a plurality of memory ports each comprising an address calculation unit

Processors, systems and methods are provided for thread level parallel processing. A processor may comprise a plurality of processing elements (PEs) each having a plurality of arithmetic logic units (ALUs) that are configured to execute a same instruction in parallel threads and a plurality of memory ports (MPs) for the plurality of PEs to access a memory unit. Each of the plurality of MPs may comprise an address calculation unit configured to generate respective memory addresses for each thread to access a common area in the memory unit.