Patent classifications
G06F15/7889
Framework to provide time bound execution of co-processor commands
When a main processor issues a command to co-processor, a timeout value is included in the command. As the co-processor attempts to execute the command, it is determined whether the attempt is taking time beyond what is permitted by the timeout value. If the timeout is exceeded then responsive action is taken, such as the generation of a command timeout type failure message. The receipt of the command with the timeout value, and the consequent determination of a timeout condition for the command, may be determined by: the co-processor that receives the command, or a watchdog timer that is separate from the co-processor. Also, detection of co-processor hang and/or hung co-processor conditions during the time that a co-processor is executing a command for the main processor.
Reconfigurable Parallel Processing
Processors, systems and methods are provided for thread level parallel processing. A processor may comprise a plurality of processing elements (PEs) that each may comprise a configuration buffer, a sequencer coupled to the configuration buffer of each of the plurality of PEs and configured to distribute one or more PE configurations to the plurality of PEs, and a gasket memory coupled to the plurality of PEs and being configured to store at least one PE execution result to be used by at least one of the plurality of PEs during a next PE configuration.
Custom compute cores in integrated circuit devices
A system includes a processor and a hardware accelerator coupled to the processor. The hardware accelerator includes data analysis elements configured to analyze a data stream based on configuration data and to output a result, and an integrated circuit device that includes a DMA engine that writes input data to and read output data from the data analysis elements, one or more preprocessing cores that receive the input data from the DMA engine prior to the DMA engine writing the input data to the one or more data analysis elements and perform custom preprocessing functions on the input data, and one or more post-processing cores that receive the output data from the DMA engine after the output data is read from the data analysis elements but prior to the output data being output to the processor and perform custom post-processing functions on the output data.
Reconfigurable System-On-Chip
A system-on-chip comprises: a first sub-circuit having a defined interface and a defined fixed-hardware functionality; a second reconfigurable sub-circuit being signal-connected via the interface to the first sub-circuit; and one or more terminals. The second sub-circuit is configured as an interface circuit between the terminals and the first sub-circuit. The first sub-circuit and the second sub-circuit are split into a plurality of individual first and second circuit blocks. At least one of said first circuit blocks is signal-connected via signal connections, each running through one or more of the second circuit blocks, to one or more other first circuit blocks or one or more of the terminals. One or more of said signal connections are reconfigurable, by the respective one or more second circuit blocks pertaining to the respective signal connection. The SOC is reconfigurable before or during its operation by reconfiguring at least one of said second circuit blocks.
Packet transmission method and apparatus
A packet transmission apparatus includes a processor such as a central processing unit (CPU), a first processing chip, and a second processing chip. The second processing chip is separately connected to the processor and the first processing chip. The second processing chip is disposed between the processor and the first processing chip. The first processing chip is a non-programmable chip such as an application-specific integrated circuit (ASIC) chip, and the second processing chip is a programmable chip such as a field-programmable gate array (FPGA) chip. The second processing chip supports a second function, and the second function is updatable. Both the processor and the first processing chip are configured to exchange a packet with the second processing chip. The second processing chip is configured to process a received packet based on the second function and send the processed packet to the processor or the first processing chip.
HIGH THROUGHPUT CIRCUIT ARCHITECTURE FOR HARDWARE ACCELERATION
A hardware acceleration device can include a switch communicatively linked to a host central processing unit (CPU), an adapter coupled to the switch via a control bus, wherein the control bus is configured to convey addresses of descriptors from the host central CPU to the adapter, and a random-access memory (RAM) coupled to the switch through a data bus. The RAM is configured to store descriptors received from the host CPU via the data bus. The hardware acceleration device can include a compute unit coupled to the adapter and configured to perform operations specified by the descriptors. The adapter may be configured to retrieve the descriptors from the RAM via the data bus, provide arguments from the descriptors to the compute unit, and provide control signals to the compute unit to initiate the operations using the arguments.
Systems, apparatus, methods, and architectures for a neural network workflow to generate a hardware accelerator
Methods, systems, apparatus, and circuits for dynamically optimizing the circuit for forward and backward propagation phases of training for neural networks, given a fixed resource budget. The circuits comprising: (1) a specialized circuit that can operate on a plurality of multi-dimensional inputs and weights for the forward propagations phase of neural networks; and (2) a specialized circuit that can operate on either gradients and inputs, or gradients and weights for the backward propagation phase of neural networks. The method comprising: (1) an analysis step to obtain the number of operations and the precision of operations in the forward and backward propagations phases of the neural network; (2) a sampling step to obtain the number of zero-valued activations and gradients during the execution of the neural network; (3) a scheduling and estimation step to obtain the runtime for the forward and backward phases of neural network execution using specialized circuits; (4) a builder step to apply the optimal breakdown of resource budget for the forward and backward phases of the neural network to improve the execution of the Neural Network training for future iterations.
Method for data synchronization between host side and FPGA accelerator
Disclosed are a method for data synchronization between a host side and a Field Programmable Gate Array (FPGA) accelerator, a Bidirectional Memory Synchronize Engine (DMSE), a FPGA accelerator, and a data synchronization system. The method includes: in response to detection of data migration from a host side to a preset memory space, generating second state information according to first state information in a first address space, and writing the second state information to a second address space (S201); and in response to detection of the second state information in the second address space, calling Direct Memory Access (DMA) to migrate data in the preset memory space to a memory space of a FPGA accelerator, and copying the second state information to the first address space, so as to implement synchronization (S202).
Core-to-core start “offload” instruction(s)
Embodiments involving core-to-core offload are detailed herein. For example, a processor core comprising performance monitoring circuitry to monitor performance of the core, an offload phase tracker to maintain status information about at least an availability of a second core to act as a helper core for the first core, decode circuitry to decode an instruction having fields for at least an opcode to indicate a start a task offload operation is to be performed, and execution circuitry to execute the decoded instruction to: cause a transmission an offload start request to at least the second core, the offload start request including one or more of: an identifier of the first core, a location of where the second core can find the task to perform, an identifier of the second core, an instruction pointer from the code that the task is a proper subset of, a requesting core state, and a requesting core state location is described.
Reconfigurable parallel processing
Processors, systems and methods are provided for thread level parallel processing. A processor may comprise a plurality of processing elements (PEs) that each may comprise a configuration buffer, a sequencer coupled to the configuration buffer of each of the plurality of PEs and configured to distribute one or more PE configurations to the plurality of PEs, and a gasket memory coupled to the plurality of PEs and being configured to store at least one PE execution result to be used by at least one of the plurality of PEs during a next PE configuration.