Patent classifications
G06F15/7864
Peripheral component interconnect (PCI) backplane connectivity system on chip (SoC)
An integrated circuit. The integrated circuit comprises an interconnect communication bus and a plurality of peripheral component interconnect (PCI) multi-function endpoints (MFN-EPs) coupled to the interconnect communication bus, each PCI MFN-EP comprising a multiplexing device, a first address translation unit (ATU), and at least one PCI function circuit, each PCI function circuit comprising another ATU and a plurality of base address registers (BARs).
SCALABLE 2.5D INTERFACE CIRCUITRY
A multichip package having a main die coupled to one or more daughter dies is provided. The main die may include embedded universal interface blocks (UIB) each of which can be used to interface with a corresponding daughter die to support high bandwidth parallel or serial communications. Each UIB may include an integrated processor subsystem and associated pattern sequencing logic to perform interface initialization and margining operations. Each UIB may perform simultaneous accesses to a daughter die across one or more channels. Each UIB may also include multiple phase-locked loop circuits for providing different clock signals to different portions of the UIB and a 2 clock phase generation logic. Each UIB may include multiple IO modules, each of which may optionally include its own duty cycle correction circuit. Each IO module may include buffer circuits, each of which may have a de-emphasis control logic for adjusting buffer drive strength.
Methods and apparatus for controlling interface circuitry
A multichip package having a main die coupled to one or more daughter dies is provided. The main die may include embedded universal interface blocks (UIB) each of which can be used to interface with a corresponding daughter die to support high bandwidth parallel or serial communications. Each UIB may include an integrated processor subsystem and associated pattern sequencing logic to perform interface initialization and margining operations. Each UIB may perform simultaneous accesses to a daughter die across one or more channels. Each UIB may also include multiple phase-locked loop circuits for providing different clock signals to different portions of the UIB and a 2 clock phase generation logic. Each UIB may include multiple IO modules, each of which may optionally include its own duty cycle correction circuit. Each IO module may include buffer circuits, each of which may have a de-emphasis control logic for adjusting buffer drive strength.
Method and apparatus for performing machine learning operations in parallel on machine learning hardware
A method includes receiving a first set of data. The method also includes receiving an instruction to determine a largest value within the first set of data. The first set of data is divided into a first plurality of data portions based on a hardware architecture of a first plurality of processing elements. The first plurality of data portions is mapped to the first plurality of processing elements. Each data portion of the first plurality of data portions is mapped exclusively to a processing element of the first plurality of processing elements. Each data portion of the first plurality of data portions is processed by its respective processing element to identify a largest value from each data portion of the first plurality of data portions, wherein the processing forms a first output data comprising the largest value from the each data portion of the first plurality of data portions.
SINGLE INSTRUCTION SET ARCHITECTURE (ISA) FORMAT FOR MULTIPLE ISAS IN MACHINE LEARNING INFERENCE ENGINE
A programmable hardware system for machine learning (ML) includes a core and an inference engine. The core receives commands from a host. The commands are in a first instruction set architecture (ISA) format. The core divides the commands into a first set for performance-critical operations, in the first ISA format, and a second set of performance non-critical operations, in the first ISA format. The core executes the second set to perform the performance non-critical operations of the ML operations and streams the first set to inference engine. The inference engine generates a stream of the first set of commands in a second ISA format based on the first set of commands in the first ISA format. The first set of commands in the second ISA format programs components within the inference engine to execute the ML operations to infer data.
ARRAY-BASED INFERENCE ENGINE FOR MACHINE LEARNING
An array-based inference engine includes a plurality of processing tiles arranged in a two-dimensional array of a plurality of rows and a plurality of columns. Each processing tile comprises at least one or more of an on-chip memory (OCM) configured to load and maintain data from the input data stream for local access by components in the processing tile and further configured to maintain and output result of the ML operation performed by the processing tile as an output data stream. The array includes a first processing unit (POD) configured to perform a dense and/or regular computation task of the ML operation on the data in the OCM. The array also includes a second processing unit/element (PE) configured to perform a sparse and/or irregular computation task of the ML operation on the data in the OCM and/or from the POD.
ARCHITECTURE FOR IRREGULAR OPERATIONS IN MACHINE LEARNING INFFERENCE ENGINE
A processing unit of an inference engine for machine learning (ML) includes a first data load steamer, a second data load streamer, an operator component, and a store streamer. The first data load streamer streams a first data stream from an on-chip memory (OCM) to the operator component. The second data load streamer streams a second data stream from the OCM to the operator component. The operator component performs a matrix operation on the first data stream and the second data stream. The store streamer receives a data output stream from the operator component and to store the data output stream in a buffer.
STREAMING ENGINE FOR MACHINE LEARNING ARCHITECTURE
A programmable hardware system for machine learning (ML) includes a core and a streaming engine. The core receives a plurality of commands and a plurality of data from a host to be analyzed and inferred via machine learning. The core transmits a first subset of commands of the plurality of commands that is performance-critical operations and associated data thereof of the plurality of data for efficient processing thereof. The first subset of commands and the associated data are passed through via a function call. The streaming engine is coupled to the core and receives the first subset of commands and the associated data from the core. The streaming engine streams a second subset of commands of the first subset of commands and its associated data to an inference engine by executing a single instruction.
ARCHITECTURE OF CROSSBAR OF INFERENCE ENGINE
A programmable hardware system for machine learning (ML) includes a core and an inference engine. The core receives commands from a host. The commands are in a first instruction set architecture (ISA) format. The core divides the commands into a first set for performance-critical operations, in the first ISA format, and a second set of performance non-critical operations, in the first ISA format. The core executes the second set to perform the performance non-critical operations of the ML operations and streams the first set to inference engine. The inference engine generates a stream of the first set of commands in a second ISA format based on the first set of commands in the first ISA format. The first set of commands in the second ISA format programs components within the inference engine to execute the ML operations to infer data.
ARCHITECTURE FOR DENSE OPERATIONS IN MACHINE LEARNING INFERENCE ENGINE
A processing unit of an inference engine for machine learning (ML) includes a first, a second, and a third register, and a matrix multiplication block. The first register receives a first stream of data associated with a first matrix data that is read only once. The second register receives a second stream of data associated with a second matrix data that is read only once. The matrix multiplication block performs a multiplication operation based on data from the first register and the second register resulting in an output matrix. A row associated with the first matrix is maintained while rows associated with the second matrix is fed to the matrix multiplication block to perform a multiplication operation. The process is repeated for each row of the first matrix. The third register receives the output matrix from the matrix multiplication block and stores the output matrix.