G06F15/7889

System having a hybrid threading processor, a hybrid threading fabric having configurable computing elements, and a hybrid interconnection network
11579887 · 2023-02-14 · ·

Representative apparatus, method, and system embodiments are disclosed for configurable computing. In a representative embodiment, a system includes an interconnection network, a processor, a host interface, and a configurable circuit cluster. The configurable circuit cluster may include a plurality of configurable circuits arranged in an array; an asynchronous packet network and a synchronous network coupled to each configurable circuit of the array; and a memory interface circuit and a dispatch interface circuit coupled to the asynchronous packet network and to the interconnection network. Each configurable circuit includes instruction or configuration memories for selection of a current data path configuration, a master synchronous network input, and a data path configuration for a next configurable circuit.

DATA PROCESSING METHOD AND APPARATUS, DISTRIBUTED DATA FLOW PROGRAMMING FRAMEWORK, AND RELATED COMPONENT
20230004433 · 2023-01-05 ·

A data processing method, a data processing apparatus, a distributed data flow programming framework, an electronic device, and a storage medium. The data processing method includes: dividing a data processing task into a plurality of data processing subtasks (S101); determining, in a Field Programmable Gate Array (FPGA) accelerator side, a target FPGA acceleration board corresponding to each of the data processing subtasks (S102); and sending data to be computed to the target FPGA acceleration board, and executing the corresponding data processing subtask by use of each of the target FPGA acceleration boards to obtain a data processing result (S103). According to the method, a physical limitation of host interfaces on the number of FPGA acceleration boards in an FPGA accelerator side may be avoided, thereby improving the data processing efficiency.

RECONFIGURABLE SIMD ENGINE
20230214351 · 2023-07-06 · ·

An exemplary SIMD computing system comprises a SIMD processing element (SPE) configured to perform a selected operation on a portion of a processor input data word, with the operation selected by control signals read from a control memory location addressed by a decoded instruction. The SPE may comprise one or more adder, multiplier, or multiplexer coupled to the control signals. The control signals may comprise one or more bit read from the control memory. The control memory may be an MxN (M rows by N columns) memory having M possible SIMD operations and N control signals. Each instruction decoded may select an SPE operation from among N rows. A plurality of SPEs may receive the same control signals. The control memory may be rewritable, advantageously permitting customizable SIMD operations that are reconfigurable by storing in the control memory locations control signals designed to cause the SPE to perform selected operations.

Multiple dies hardware processors and methods

Methods and apparatuses relating to hardware processors with multiple interconnected dies are described. In one embodiment, a hardware processor includes a plurality of physically separate dies, and an interconnect to electrically couple the plurality of physically separate dies together. In another embodiment, a method to create a hardware processor includes providing a plurality of physically separate dies, and electrically coupling the plurality of physically separate dies together with an interconnect.

Machine learning model updates to ML accelerators
11586578 · 2023-02-21 · ·

Examples herein describe a peripheral I/O device with a hybrid gateway that permits the device to have both I/O and coherent domains. As a result, the compute resources in the coherent domain of the peripheral I/O device can communicate with the host in a similar manner as CPU-to-CPU communication in the host. The dual domains in the peripheral I/O device can be leveraged for machine learning (ML) applications. While an I/O device can be used as an ML accelerator, these accelerators previously only used an I/O domain. In the embodiments herein, compute resources can be split between the I/O domain and the coherent domain where a ML engine is in the I/O domain and a ML model is in the coherent domain. An advantage of doing so is that the ML model can be coherently updated using a reference ML model stored in the host.

METHOD FOR DATA SYNCHRONIZATION BETWEEN HOST SIDE AND FPGA ACCELERATOR
20230098879 · 2023-03-30 ·

Disclosed are a method for data synchronization between a host side and a Field Programmable Gate Array (FPGA) accelerator, a Bidirectional Memory Synchronize Engine (DMSE), a FPGA accelerator, and a data synchronization system. The method includes: in response to detection of data migration from a host side to a preset memory space, generating second state information according to first state information in a first address space, and writing the second state information to a second address space (S201); and in response to detection of the second state information in the second address space, calling Direct Memory Access (DMA) to migrate data in the preset memory space to a memory space of a FPGA accelerator, and copying the second state information to the first address space, so as to implement synchronization (S202).

Dynamic load balancing and configuration management for heterogeneous compute accelerators in a data center

An example method of managing a plurality of hardware accelerators in a computing system includes executing workload management software in the computing system configured to allocate a plurality of jobs in a job queue among a pool of resources in the computer system; monitoring the job queue to determine required hardware functionalities for the plurality of jobs; provisioning at least one hardware accelerator of the plurality of hardware accelerators to provide the required hardware functionalities; configuring a programmable device of each provisioned hardware accelerator to implement at least one of the required hardware functionalities; and notifying the workload management software that each provisioned hardware accelerator is an available resource in the pool of resources.

DIGITAL PRE-DISTORTION (DPD) ADAPTATION USING A HYBRID HARDWARE ACCELERATOR AND PROGRAMMABLE ARRAY ARCHITECTURE
20230205727 · 2023-06-29 ·

Techniques are disclosed for the use of a hybrid architecture that combines a programmable processing array and a hardware accelerator. The hybrid architecture dedicates the most computationally intensive blocks to the hardware accelerator, which may be implemented for the computation of pre-distortion (DPD) coefficients while maintaining flexibility for additional DPD computations to be performed by the programmable processing array. An interface is also described for coupling the processing array to the hardware accelerator, which achieves a division of functionality and connects the programmable processing array components to the hardware accelerator components without sacrificing flexibility. This results in a balance between power/area and flexibility.

MACHINE LEARNING MODEL UPDATES TO ML ACCELERATORS
20230195684 · 2023-06-22 ·

Examples herein describe a peripheral I/O device with a hybrid gateway that permits the device to have both I/O and coherent domains. As a result, the compute resources in the coherent domain of the peripheral I/O device can communicate with the host in a similar manner as CPU-to-CPU communication in the host. The dual domains in the peripheral I/O device can be leveraged for machine learning (ML) applications. While an I/O device can be used as an ML accelerator, these accelerators previously only used an I/O domain. In the embodiments herein, compute resources can be split between the I/O domain and the coherent domain where a ML engine is in the I/O domain and a ML model is in the coherent domain. An advantage of doing so is that the ML model can be coherently updated using a reference ML model stored in the host.

TECHNOLOGIES FOR HARDWARE MICROSERVICES ACCELERATED IN XPU

Methods, apparatus, and software and for hardware microservices accelerated in other processing units (XPUs). The apparatus may be a platform including a System on Chip (SOC) and an XPU, such as a Field Programmable Gate Array (FPGA). The FPGA is configured to implement one or more Hardware (HW) accelerator functions associated with HW microservices. Execution of microservices is split between a software front-end that executes on the SOC and a hardware backend comprising the HW accelerator functions. The software front-end offloads a portion of a microservice and/or associated workload to the HW microservice backend implemented by the accelerator functions. An XPU or FPGA proxy is used to provide the microservice front-ends with shared access to HW accelerator functions, and schedules/multiplexes access to the HW accelerator functions using, e.g., telemetry data generated by the microservice front-ends and/or the HW accelerator functions. The platform may be an infrastructure processing unit (IPU) configured to accelerate infrastructure operations.