G06F15/17393

Methods and apparatus for sharing nodes in a network with connections based on 1 to k+1 adjacency used in an execution array memory array (XarMa) processor
11249939 · 2022-02-15 ·

An Execution Array Memory Array (XarMa©) processor is described for signal processing and internet of things (IoT) applications, (pronounced sharma, that means happiness in Sanskrit). The XarMa© processor uses a 1 to K+1 adjacency network in an array of execution units. The 1 to K+1 adjacency refers to connections separately made in rows and in columns of execution unit and local file nodes, where the number of R.sub.ows≥K>1 and of C.sub.olumns≥K>1 and K is an odd integer. Instead of a large central multi-ported register file, a distributed set of storage files local to each execution unit is used. The instruction set architecture uses instructions that specify forwarding of execution results to execution units associated with destination instructions. This execution array is scalable to support cost effective and low power high-performance application specific processing focused on target product requirements.

Emulating broadcast in a network on chip
11411759 · 2022-08-09 · ·

An integrated circuit chip has a set of communication units, each unit being configured to operate according to a protocol in which a data packet sent by one unit is receivable by one unit only, each unit being configured to send at least one packet having one of a plurality of tiers to at least one other unit and being configured to specify, for each tier, a subset of destination units to which packets of that tier are to be sent, wherein each unit is configured to: receive a packet having one of the plurality of tiers; determine the tier of the received packet; and sequentially send packets having a different tier to the tier of the received packet to each of the respective subset of destination units for the different tier.

Emulating Broadcast in a Network on Chip
20210036880 · 2021-02-04 ·

An integrated circuit chip has a set of communication units, each unit being configured to operate according to a protocol in which a data packet sent by one unit is receivable by one unit only, each unit being configured to send at least one packet having one of a plurality of tiers to at least one other unit and being configured to specify, for each tier, a subset of destination units to which packets of that tier are to be sent, wherein each unit is configured to: receive a packet having one of the plurality of tiers; determine the tier of the received packet; and sequentially send packets having a different tier to the tier of the received packet to each of the respective subset of destination units for the different tier.

Embedding global barrier and collective in a torus network

Embodiments of the invention provide a method, system and computer program product for embedding a global barrier and global interrupt network in a parallel computer system organized as a torus network. The computer system includes a multitude of nodes. In one embodiment, the method comprises taking inputs from a set of receivers of the nodes, dividing the inputs from the receivers into a plurality of classes, combining the inputs of each of the classes to obtain a result, and sending said result to a set of senders of the nodes. Embodiments of the invention provide a method, system and computer program product for embedding a collective network in a parallel computer system organized as a torus network. In one embodiment, the method comprises adding to a torus network a central collective logic to route messages among at least a group of nodes in a tree structure.

Methods and Apparatus for Sharing Nodes in a Network with Connections Based on 1 to K+1 Adjacency Used in an Execution Array Memory Array (XarMa) Processor
20200250131 · 2020-08-06 ·

An Execution Array Memory Array (XarMa) processor is described for signal processing and internet of things (IoT) applications, (pronounced sharma, that means happiness in Sanskrit). The XarMa processor uses a 1 to K+1 adjacency network in an array of execution units. The 1 to K+1 adjacency refers to connections separately made in rows and in columns of execution unit and local file nodes, where the number of R.sub.owsK>1 and of C.sub.olumnsK>1 and K is an odd integer. Instead of a large central multi-ported register file, a distributed set of storage files local to each execution unit is used. The instruction set architecture uses instructions that specify forwarding of execution results to execution units associated with destination instructions. This execution array is scalable to support cost effective and low power high-performance application specific processing focused on target product requirements.

Opcode counting for performance measurement

Methods, systems and computer program products are disclosed for measuring a performance of a program running on a processing unit of a processing system. In one embodiment, the method comprises informing a logic unit of each instruction in the program that is executed by the processing unit, assigning a weight to each instruction, assigning the instructions to a plurality of groups, and analyzing the plurality of groups to measure one or more metrics. In one embodiment, each instruction includes an operating code portion, and the assigning includes assigning the instructions to the groups based on the operating code portions of the instructions. In an embodiment, each type of instruction is assigned to a respective one of the plurality of groups. These groups may be combined into a plurality of sets of the groups.

SERIALIZED BROADCAST COMMAND MESSAGING IN A DISTRIBUTED SYMMETRIC MULTIPROCESSING (SMP) SYSTEM

Serialized broadcast command messaging in a distributed symmetric multiprocessing (SMP) system, including: sending a serial broadcast command from a home chip to a plurality of other chips comprising serial primary chip, wherein each chip of the plurality of other chips comprises a broadcast controller mapped to a home chip broadcast controller; assigning, by the serial primary chip, a tag to the serial broadcast command; sending, from the serial primary chip, to the home chip and a remainder of the plurality of other chips, the tag; and broadcasting, by the home chip and each chip of the plurality of other chips, the serial broadcast command in an order based on the tag of the serial broadcast command.

Combining Multiple Data Streams at a Receiver for QoS Balance in Asymmetric Communications
20180287943 · 2018-10-04 · ·

A receiver for combining multiple data streams at a destination device, where at least one of the multiple data streams may be a redundant data stream sent intermittently to a destination device. The redundant data stream may be sent intermittently depending on the availability of resources available to a source device. Based on monitoring of quality of service (QoS) parameters of an original data stream on the uplink, a base station functioning as an auxiliary receiver may be intermittently activated to intercept the data stream as a redundant data stream. The auxiliary receiver may then send the redundant data stream to the destination device to it QoS for the original data stream. The receiver may receive the data stream alone, or receive both of the redundant second data stream and the original data stream simultaneously, on an intermittent basis to improve QoS for the original data stream.

OPCODE COUNTING FOR PERFORMANCE MEASUREMENT
20180203693 · 2018-07-19 ·

Methods, systems and computer program products are disclosed for measuring a performance of a program running on a processing unit of a processing system. In one embodiment, the method comprises informing a logic unit of each instruction in the program that is executed by the processing unit, assigning a weight to each instruction, assigning the instructions to a plurality of groups, and analyzing the plurality of groups to measure one or more metrics. In one embodiment, each instruction includes an operating code portion, and the assigning includes assigning the instructions to the groups based on the operating code portions of the instructions. In an embodiment, each type of instruction is assigned to a respective one of the plurality of groups. These groups may be combined into a plurality of sets of the groups.

PROCESSOR SYSTEM AND METHOD FOR INCREASING DATA-TRANSFER BANDWIDTH DURING EXECUTION OF A SCHEDULED PARALLEL PROCESS

A broadcast subsystem of a processor system includes: a set of broadcast buses, each broadcast bus in the set of broadcast buses electrically coupled to a subset of primary memory units in the set of primary memory units; a primary memory unit queue: configured to store a first set of data transfer requests associated with the set of primary memory units; electrically coupled to the data buffer a broadcast scheduler: electrically coupled to the primary memory unit queue; electrically coupled to the set of broadcast buses; and configured to transfer source data from the data buffer to a target subset of primary memory units in the set of primary memory units via the set of broadcast buses based on the set of data transfer requests stored in the primary memory unit queue.