G06F2213/0062

Arbitrating portions of transactions over virtual channels associated with an interconnect

Arbitrating among portions of multiple transactions and transmitting a winning portion over one of a multiplicity of virtual channels associated with an interconnect on a clock cycle-by-clock cycle basis. By repeatedly performing the above each clock cycle, winning portions are interleaved and transmitted over the multiplicity of virtual channels over multiple clock cycles respectively.

Mitigating communication bottlenecks during parameter exchange in data-parallel DNN training

An interconnect topology for communication between GPUs in a computing system is determined. A quantity of directed spanning trees are generated for transmitting data between the GPUs using the interconnect topology and packed. The directed spanning trees define the connections between GPUs that are to be utilized for the transmission and the amount of data to be transmitted on each connection. Program code is generated for implementing the data transfer defined by the directed spanning trees. When the program code is executed, the directed spanning trees are used to pipeline the transmission of chunks of data, such as model parameters used during data-parallel DNN training, between the GPUs. The program code can also determine an optimal chunk size for data to be transferred between the GPUs.

Network interface device that sets an ECN-CE bit in response to detecting congestion at an internal bus interface

A network device includes a Network Interface Device (NID) and multiple servers. Each server is coupled to the NID via a corresponding PCIe bus. The NID has a network port through which it receives packets. The packets are destined for one of the servers. The NID detects a PCIe congestion condition regarding the PCIe bus to the server. Rather than transferring the packet across the bus, the NID buffers the packet and places a pointer to the packet in an overflow queue. If the level of bus congestion is high, the NID sets the packet's ECN-CE bit. When PCIe bus congestion subsides, the packet passes to the server. The server responds by returning an ACK whose ECE bit is set. The originating TCP endpoint in turn reduces the rate at which it sends data to the destination server, thereby reducing congestion at the PCIe bus interface within the network device.

MITIGATING COMMUNICATION BOTTLENECKS DURING PARAMETER EXCHANGE IN DATA-PARALLEL DNN TRAINING

Technologies are disclosed herein for dynamically generating communication primitives for use in model parameter synchronization during data-parallel DNN training by packing directed spanning trees. An interconnect topology for communication between GPUs in a computing system is determined. A quantity of directed spanning trees are generated for transmitting data between the GPUs using the interconnect topology and packed. The directed spanning trees define the connections between GPUs that are to be utilized for the transmission and the amount of data to be transmitted on each connection. Program code is generated for implementing the data transfer defined by the directed spanning trees. When the program code is executed, the directed spanning trees are used to pipeline the transmission of chunks of data, such as model parameters used during data-parallel DNN training, between the GPUs. The program code can also determine an optimal chunk size for data to be transferred between the GPUs.

Procedures for implementing source based routing within an interconnect fabric on a system on chip

Optimizing transaction traffic on a System on a Chip (SoC) by using procedures such as expanding transactions and consolidating responses at nodes of an interconnect fabric for broadcasts, multi-casts, any-casts, source based routing type transactions, intra-streaming two or more transactions over a stream defined by a paired virtual channel-transaction class, trunking physical resources sharing common logical identifier, and using hashing to select among multiple physical resources sharing a common logical identifier.

Sizing of one or more jobs within one or more time windows

Aspects of the present invention provide systems and methods that ascertain appropriate time windows for when a task or tasks are best suited to be performed and also for appropriately sizing the number of tasks so that they can be completed within a window or windows. In embodiments, a system for estimating a system resource comprises a component selection system configured to receive a selection of one or more components and/or one or more metrics to be monitored. In embodiments, an analyzer uses at least some of the gathered data to determine one or more resource capacity windows for performing a task or tasks and determines an appropriate job size or sizes for scheduling the tasks to be performed within the one or more resource capacity windows.

Techniques for optimizing entropy computations

Techniques for data processing may include: computing an entropy value for the chunk; determining, in accordance with the entropy value for the data chunk, whether the data chunk is compressible; and responsive to determining the data chunk is compressible based on the entropy value for the chunk, compressing the data chunk. The entropy value may be determined using counters for data items where the counters denote current frequencies of different allowable data items in the data chunk; and performing second processing using the counters to determine an entropy value for the data chunk, wherein said second processing includes selecting a precomputed binary logarithmic value from a table for each of the counters. The table may include integer representations of binary logarithmic values. The second processing may include loading multiple data items of the chunk into a register, extracting each data item from the register and incrementing a corresponding counter.

Seamlessly Integrated Microcontroller Chip
20240126708 · 2024-04-18 · ·

Techniques in electronic systems, such as in systems comprising a CPU die and one or more external mixed-mode (analog) chips, may provide improvements advantages in one or more of system design, performance, cost, efficiency and programmability. In one embodiment, the CPU die comprises at least one microcontroller CPU and circuitry enabling the at least one CPU to have a full and transparent connectivity to an analog chip as if they are designed as a single chip microcontroller, while the interface design between the two is extremely efficient and with limited in number of wires, yet may provide improved performance without impact to functionality or the software model.

PROCEDURES FOR IMPROVING EFFICIENCY OF AN INTERCONNECT FABRIC ON A SYSTEM ON CHIP

Optimizing transaction traffic on a System on a Chip (SoC) by using procedures such as expanding transactions and consolidating responses at nodes of an interconnect fabric for broadcasts, multi-casts, any-casts, source based routing type transactions, intra-streaming two or more transactions over a stream defined by a paired virtual channel-transaction class, trunking physical resources sharing common logical identifier, and using hashing to select among multiple physical resources sharing a common logical identifier.

PROCEDURES FOR IMPROVING EFFICIENCY OF AN INTERCONNECT FABRIC ON A SYSTEM ON CHIP

Optimizing transaction traffic on a System on a Chip (SoC) by using procedures such as expanding transactions and consolidating responses at nodes of an interconnect fabric for broadcasts, multi-casts, any-casts, source based routing type transactions, intra-streaming two or more transactions over a stream defined by a paired virtual channel-transaction class, trunking physical resources sharing common logical identifier, and using hashing to select among multiple physical resources sharing a common logical identifier.