G06F15/17381

Method and system for converting a single-threaded software program into an application-specific supercomputer

The invention comprises (i) a compilation method for automatically converting a single-threaded software program into an application-specific supercomputer, and (ii) the supercomputer system structure generated as a result of applying this method. The compilation method comprises: (a) Converting an arbitrary code fragment from the application into customized hardware whose execution is functionally equivalent to the software execution of the code fragment; and (b) Generating interfaces on the hardware and software parts of the application, which (i) Perform a software-to-hardware program state transfer at the entries of the code fragment; (ii) Perform a hardware-to-software program state transfer at the exits of the code fragment; and (iii) Maintain memory coherence between the software and hardware memories. If the resulting hardware design is large, it is divided into partitions such that each partition can fit into a single chip. Then, a single union chip is created which can realize any of the partitions.

Networked computer with multiple embedded rings
11704270 · 2023-07-18 · ·

A network comprising interconnected first and second processors, each processor comprising one or more of: multiple processing units arranged on a chip configured to execute program code; an on-chip interconnect comprising groups of exchange paths connected to receive data from corresponding groups of the processing units; external interfaces configured to communicate data off-chip as packets, each having a destination address, external interfaces of the first and second processors being connected by an external link; multiple exchange blocks, each connected to groups of the exchange paths; a routing bus configured to route packets between the exchange blocks and the external interfaces. Processing units of the first processor generate off-chip packets such that the group of processing units serviced by the first exchange block on the first processor address off-chip packets to the group of processing units on the second processor serviced by the corresponding first exchange block of the second processor.

Parallel processing of reduction and broadcast operations on large datasets of non-scalar data

Methods, systems, and apparatus, including instructions encoded on storage media, for performing reduction of gradient vectors and similarly structured data that are generated in parallel, for example, on nodes organized in a mesh or torus topology defined by connections in at least two dimension between the nodes. The methods provide parallel computation and communication between nodes in the topology.

High bandwidth memory system with distributed request broadcasting masters

A system comprises a processor and a plurality of memory units. The processor is coupled to each of the plurality of memory units by a plurality of network connections. The processor includes a plurality of processing elements arranged in a two-dimensional array and a corresponding two-dimensional communication network communicatively connecting each of the plurality of processing elements to other processing elements on same axes of the two-dimensional array. Each processing element that is located along a diagonal of the two-dimensional array is configured as a request broadcasting master for a respective group of processing elements located along a same axis of the two-dimensional array.

System on Chip Comprising a Connection Interface Between Master Devices and Slave Devices
20220405232 · 2022-12-22 ·

In an embodiment a system on chip includes at least one master device, at least one slave device, a connection interface configured to route signals between the at least one master device and the at least one slave device, the connection interface configured to operate according to configuration parameters, and a configuration bus connected to the connection interface, wherein the configuration bus is configured to deliver new configuration parameters to the connection interface so as to adapt operation of the connection interface.

Embedding rings on a toroid computer network
11531637 · 2022-12-20 · ·

A computer comprising a plurality of interconnected processing nodes arranged in a toroid configuration in which multiple layers of interconnected nodes are arranged along an axis; each layer comprising a plurality of processing nodes connected in a ring in a non-axial plane by at least an intralayer respective set of links between each pair of neighbouring processing nodes, the links in each set adapted to operate simultaneously; wherein each of the processing nodes in each layer is connected to a respective corresponding node in each adjacent layer by an interlayer link to form respective rings along the axis; the computer programmed to provide a plurality of embedded one-dimensional logical paths and to transmit data around each of the embedded one-dimensional paths in such a manner that the plurality of embedded one-dimensional logical paths operate simultaneously, each logical path using all processing nodes of the computer in a sequence.

APPLIANCES AND METHODS TO PROVIDE ROBUST COMPUTATIONAL SERVICES IN ADDITION TO A/V ENCODING, FOR EXAMPLE AT EDGE OF MESH NETWORKS
20220398216 · 2022-12-15 ·

An appliance includes a system on chip (SOC) and converter. The appliance accepts A/V data (e.g., HDMI®, SDI®, IP) from external sources, encodes A/V data using an encoder of the SOC and performs additional services via other computational components of the SOC. The SOC may be a mobile SOC. The appliance may operate as an edge appliance, edge encoder, or edge-based origin-server, for instance at an edge endpoint of a mesh network, allowing many-to-many distribution of A/V data, performing computationally efficient A/V encoding, while also making available additional computational resources (e.g., cycles of CPUs, GPUs, DSPs, AI/ML NPUs) to provide other services at the edge in addition to efficient A/V encoding.

Networked computer
11614946 · 2023-03-28 · ·

A computer comprising a plurality of processing nodes is provided. Each processing node has at least one processor configured to process input data to generate an array of data items. The processing nodes are arranged in cliques in which each processing node of a clique is connected to each other processing node in the clique by first and second clique links. The cliques are inter-connected in rings such that each processing node is a member of a single clique and a single ring. The processing nodes of all cliques are configured to exchange in each exchange step of a machine learning collective via the respective first and second clique links at least two data items with the other processing node(s) in its clique, and all processing nodes are configured to reduce each received data item with the data item in the corresponding position in the array on that processing node.

DIAGONAL TORUS NETWORK

A device is disclosed that includes multiple channels and multiple processing nodes. Each processing node includes input/output (I/O) ports coupled to the channels and channel control modules coupled to the I/O ports. Each processing node is configured to select, by the channel control module in a first operation, a first I/O port of the I/O ports; communicate a first message, via the first I/O port, to a first processing node over a first channel or a second processing node over a second channel orthogonal to the first channel in a logic representation; select, by the channel control module in a second operation, a second I/O port of the I/O ports; and communicate a second message, via the second I/O port, to a third processing node over a third channel extending in a diagonal direction and non-orthogonal to the first and second channels in the logic representation.

Information processing apparatus, information processing method and non-transitory computer-readable storage medium for storing information processing program of determining relations among nodes in N-dimensional torus structure
11467876 · 2022-10-11 · ·

An information processing apparatus for controlling a plurality of nodes mutually coupled via a plurality of cables, the apparatus includes: a memory; a processor coupled to the memory, the processor being configured to cause a first node to execute first processing to extract coupling relationship between the plurality of nodes, the first node being one of the plurality of nodes, being sequentially allocated from each of the plurality of nodes, the first processing including executing allocation processing that allocates unique coordinate information to the first node and allocates common coordinate information to nodes excluding the first node; executing transmission processing that causes the first node to transmit first information to each of the cables coupled to the first node; and executing identification processing that identifies a node having received the first information as neighboring node coupled to one of the plurality of cables coupled to the first node.