G06F30/331

Synchronized clock signals for circuit emulators

A system includes a first cross-point switch receiving a first plurality of clock inputs and outputting a first plurality of clock outputs, a first plurality of buffering devices receiving the first plurality of clock outputs and outputting a first plurality of buffered clock signals synchronized with each other, a first plurality of connectors receiving the first plurality of buffered clock signals and outputting a plurality of blade signals to a plurality of blades. Each blade includes a plurality of programmable logic devices, an operation of which is synchronized based on the first plurality of clock inputs. Each blade includes a second cross-point switch to receive a blade signal of the plurality of blade signals. The second cross-point switch outputs a second plurality of clock outputs based on the received blade signal, and the second plurality of clock outputs are provided to the programmable logic devices.

PARALLEL AND SCALABLE COMPUTATION OF STRONGLY CONNECTED COMPONENTS IN A CIRCUIT DESIGN

A system identifies strongly connected components of a circuit design. The system receiving a circuit design represented as a graph including a set of vertices and a set of edges. The system marks each vertex of the set of vertices void. The system executes multiple threads, where each thread performs following steps concurrently. The thread selects a vertex from the set of vertices with void state. The thread performs a depth first search starting from the selected vertex. The thread marks a vertex as processed once the depth first search started from that vertex is completed. The depth first search skips vertices marked as processed. The thread determines a candidate SCC based on the nodes traversed by the depth first search. Once a set of candidate SCCs is determined, the system eliminates some of the candidate SCCs and stores the remaining candidate SCCs as SCCs of the graph.

PARALLEL AND SCALABLE COMPUTATION OF STRONGLY CONNECTED COMPONENTS IN A CIRCUIT DESIGN

A system identifies strongly connected components of a circuit design. The system receiving a circuit design represented as a graph including a set of vertices and a set of edges. The system marks each vertex of the set of vertices void. The system executes multiple threads, where each thread performs following steps concurrently. The thread selects a vertex from the set of vertices with void state. The thread performs a depth first search starting from the selected vertex. The thread marks a vertex as processed once the depth first search started from that vertex is completed. The depth first search skips vertices marked as processed. The thread determines a candidate SCC based on the nodes traversed by the depth first search. Once a set of candidate SCCs is determined, the system eliminates some of the candidate SCCs and stores the remaining candidate SCCs as SCCs of the graph.

Heterogeneous-computing based emulator

In an approach, a processor receives an input indicative of a set of registers, the set of registers being configured for obtaining output data from a design-under-test (DUT) in a field-programmable gate array (FPGA) module. A processor executes a set of instructions for monitoring the output data in the set of registers;. A processor generates data indicative of at least one portion of changes of the output data in the set of registers during the execution of the set of instructions. A processor causes a separate machine to analyze the data via utilizing an interface to send the data to the separate machine.

Heterogeneous-computing based emulator

In an approach, a processor receives an input indicative of a set of registers, the set of registers being configured for obtaining output data from a design-under-test (DUT) in a field-programmable gate array (FPGA) module. A processor executes a set of instructions for monitoring the output data in the set of registers;. A processor generates data indicative of at least one portion of changes of the output data in the set of registers during the execution of the set of instructions. A processor causes a separate machine to analyze the data via utilizing an interface to send the data to the separate machine.

Systems and methods for intercycle gap refresh and backpressure management

A system may include a synchronization device and an emulation chip including a processor and a memory. The processor may evaluate, during a first cycle, at least one of a set of one or more execution instructions in the memory or evaluation primitives configured to emulate a circuit, and evaluate, during a second cycle, at least one of the set of one or more execution instructions or a set of configured logic primitives. The synchronization device may interpose a gap period interposed between the first cycle and the second cycle such that during the gap period, the processor does not evaluate one or more instructions from the set of one or more execution instructions or re-evaluate primitives. The synchronization device may cause, during the first gap period, the emulation chip to perform refreshes on the memory of the emulation chip.

EXTENDED INTER-KERNEL COMMUNICATION PROTOCOL FOR THE REGISTER SPACE ACCESS OF THE ENTIRE FPGA POOL IN NON-STAR MODE
20220382944 · 2022-12-01 ·

Methods and apparatus for an extended inter-kernel communication protocol for discovery of accelerator pools configured in a non-star mode. Under a discovery algorithm, discovery requests are sent from a root node to non-root nodes in the accelerator pool using an inter-kernel communication protocol comprising a data transmission protocol built over a Media Access Control (MAC) layer and transported over links coupled between IO ports on accelerators. The discovery requests are used to discover each of the nodes in the accelerator pool and determine the topology of the nodes. During this process, MAC address table entries are generated at the various nodes comprising (key, value) pairs of MAC IO port addresses identifying destination nodes and that may be reached by each node and the shortest path to reach such destination nodes. The discovery algorithm may also be used to discover storage related information for the accelerators. The accelerators may comprise FPGAs or other processing units, such as GPUs and Vector Processing Units (VPUs).

EXTENDED INTER-KERNEL COMMUNICATION PROTOCOL FOR THE REGISTER SPACE ACCESS OF THE ENTIRE FPGA POOL IN NON-STAR MODE
20220382944 · 2022-12-01 ·

Methods and apparatus for an extended inter-kernel communication protocol for discovery of accelerator pools configured in a non-star mode. Under a discovery algorithm, discovery requests are sent from a root node to non-root nodes in the accelerator pool using an inter-kernel communication protocol comprising a data transmission protocol built over a Media Access Control (MAC) layer and transported over links coupled between IO ports on accelerators. The discovery requests are used to discover each of the nodes in the accelerator pool and determine the topology of the nodes. During this process, MAC address table entries are generated at the various nodes comprising (key, value) pairs of MAC IO port addresses identifying destination nodes and that may be reached by each node and the shortest path to reach such destination nodes. The discovery algorithm may also be used to discover storage related information for the accelerators. The accelerators may comprise FPGAs or other processing units, such as GPUs and Vector Processing Units (VPUs).

Column data driven arithmetic expression evaluation

Methods, systems, apparatuses, and computer program products are provided for generating an instruction set for an evaluation engine. An arithmetic expression that combines multiple columns of data (e.g., a first column of data, a second column of data, etc.) is received. Instructions may be generated, that, when executed by an integrated-circuit-based processor, cause the integrated-circuit-based processor to evaluate the arithmetic expression. In examples, a set of instructions may be generated for each column of data represented in the arithmetic expression. For instance, the instructions may comprise a first set of instructions associated with the first column of data, a second set of instructions associated with the second column of data, and so on. The instructions may specify one or more parameters for operations associated with each column of data, such as operations to load data from a buffer, store data into a buffer, arithmetic operations to perform on data, etc.

Column data driven arithmetic expression evaluation

Methods, systems, apparatuses, and computer program products are provided for generating an instruction set for an evaluation engine. An arithmetic expression that combines multiple columns of data (e.g., a first column of data, a second column of data, etc.) is received. Instructions may be generated, that, when executed by an integrated-circuit-based processor, cause the integrated-circuit-based processor to evaluate the arithmetic expression. In examples, a set of instructions may be generated for each column of data represented in the arithmetic expression. For instance, the instructions may comprise a first set of instructions associated with the first column of data, a second set of instructions associated with the second column of data, and so on. The instructions may specify one or more parameters for operations associated with each column of data, such as operations to load data from a buffer, store data into a buffer, arithmetic operations to perform on data, etc.