G06F15/17381

General-Purpose Systolic Array

A systolic array cell is described, the cell including two general-purpose arithmetic logic units (ALUs) and register-file. A plurality of the cells may be configured in a matrix or array, such that the output of the first ALU in a first cell is provided to a second cell to the right of the first cell, and the output of the second ALU in the first cell is provided to a third cell below the first cell. The two ALUs in each cell of the array allow for processing of a different instruction in each cycle.

Networked Computer With Multiple Embedded Rings
20210342284 · 2021-11-04 ·

A network comprising interconnected first and second processors, each processor comprising one or more of: multiple processing units arranged on a chip configured to execute program code; an on-chip interconnect comprising groups of exchange paths connected to receive data from corresponding groups of the processing units; external interfaces configured to communicate data off-chip as packets, each having a destination address, external interfaces of the first and second processors being connected by an external link; multiple exchange blocks, each connected to groups of the exchange paths; a routing bus configured to route packets between the exchange blocks and the external interfaces. Processing units of the first processor generate off-chip packets such that the group of processing units serviced by the first exchange block on the first processor address off-chip packets to the group of processing units on the second processor serviced by the corresponding first exchange block of the second processor.

Optimizing NOC performance using crossbars

A system including an array of processing elements, a plurality of periphery crossbars and a plurality of storage components is described. The array of processing elements is interconnected in a grid via a network on an integrated circuit. The periphery crossbars are connected to a plurality of edges of the array of processing elements. The storage components are connected to the periphery crossbars.

EXECUTION ENGINE FOR EXECUTING SINGLE ASSIGNMENT PROGRAMS WITH AFFINE DEPENDENCIES
20230334008 · 2023-10-19 ·

The execution engine is a new organization for a digital data processing apparatus, suitable for highly parallel execution of structured fine-grain parallel computations. The execution engine includes a memory for storing data and a domain flow program, a controller for requesting the domain flow program from the memory, and further for translating the program into programming information, a processor fabric for processing the domain flow programming information and a crossbar for sending tokens and the programming information to the processor fabric.

Multiprocessor system with improved secondary interconnection network

Embodiments of a multiprocessor system are disclosed that may include a plurality of processors interspersed with a plurality of data memory routers, a plurality of bus interface units, a bus control circuit, and a processor interface circuit. The data memory routers may be coupled together to form a primary interconnection network. The bus interface units and the bus control circuit may be coupled together in a daisy-chain fashion to form a secondary interconnection network. Each of the bus interface units may be configured to read or write data or instructions to a respective one of the plurality of data memory routers and a respective processor. The bus control circuit coupled with the processor interface circuit may be configured to function as a bidirectional bridge between the primary and secondary networks. The bus control circuit may also couple to other interface circuits and arbitrate their access to the secondary network.

Networked computer with multiple embedded rings

According to an aspect of the invention, there is provided a computer comprising a plurality of interconnected processing nodes arranged in a configuration with multiple stacked layers. Each layer comprises four processing nodes connected by respective links between the processing nodes. In end layers of the stack, the four processing nodes are interconnected in a ring formation by two links between the nodes, the two links adapted to operate simultaneously. Processing nodes in the multiple stacked layers provide four faces, each face comprising multiple layers, each layer comprising a pair of processing nodes. The processing nodes are programmed to operate a configuration to transmit data around embedded one-dimensional rings, each ring formed by processing nodes in two opposing faces.

Method and system for converting a single-threaded software program into an application-specific supercomputer

The invention comprises (i) a compilation method for automatically converting a single-threaded software program into an application-specific supercomputer, and (ii) the supercomputer system structure generated as a result of applying this method. The compilation method comprises: (a) Converting an arbitrary code fragment from the application into customized hardware whose execution is functionally equivalent to the software execution of the code fragment; and (b) Generating interfaces on the hardware and software parts of the application, which (i) Perform a software-to-hardware program state transfer at the entries of the code fragment; (ii) Perform a hardware-to-software program state transfer at the exits of the code fragment; and (iii) Maintain memory coherence between the software and hardware memories. If the resulting hardware design is large, it is divided into partitions such that each partition can fit into a single chip. Then, a single union chip is created which can realize any of the partitions.

Multiple Independent On-chip Interconnect

In an embodiment, a system on a chip (SOC) comprises a semiconductor die on which circuitry is formed, wherein the circuitry comprises a plurality of agents and a plurality of network switches coupled to the plurality of agents. The plurality of network switches are interconnected to form a plurality of physical and logically independent networks. A first network of the plurality of physically and logically independent networks is constructed according to a first topology and a second network of the plurality of physically and logically independent networks is constructed according to a second topology that is different from the first topology. For example, the first topology may a ring topology and the second topology may be a mesh topology. In an embodiment, coherency may be enforced on the first network and the second network may be a relaxed order network.

Systems and methods for implementing a machine perception and dense algorithm integrated circuit and enabling a flowing propagation of data within the integrated circuit

Systems and methods of propagating data within an integrated circuit includes: identifying a coarse data propagation path for distinct subsets of data of an input dataset that includes: setting inter-core data movements for the distinct subsets of data, the inter-core data movements defining a predetermined propagation of a given subset of data between two or more of a plurality of cores of an integrated circuit array of the integrated circuit; identifying a granular data propagation path for each distinct subset of data that includes: setting intra-core data movements for each distinct subset of data, the intra-core data movements defining a predetermined propagation of the given subset of data within one or more of the plurality of cores of the integrated circuit array of the integrated circuit; enabling a flow of the input dataset within the integrated circuit based on the coarse data propagation path and the granular propagation path.

Management of access restriction within a system on chip

A system includes a plurality of items of master equipment, each having a programing interface, and a plurality of slave equipment. An interconnect circuit is coupled between the items of master equipment and the items of slave equipment. Each transaction is assigned an attribute capable of taking on at least two attribute values corresponding to at least two states for the master equipment. Each item of slave equipment is associated with an identifier capable of taking on at least two values corresponding respectively to at least two properties for the slave equipment. Each item of master equipment automatically inherits the property of its programing interface. A filtering circuit is configured to, in the presence of a transaction intended for an item of slave equipment, compare the corresponding attribute value with an identifier value of the intended slave equipment and reject or not reject the transaction based on the comparison.