Patent classifications
G06F15/17381
POWER MANAGEMENT CIRCUIT AND SYSTEM THEREOF
A power management circuit and system thereof are provided. The power management circuit includes M×N computing units, a first power supply unit, a second power supply unit and N-1 connection interfaces. M and N are both natural numbers greater than 1. The first power supply unit supplies power to the computing units of the Nth row, the computing units of the Nth row supply power to the computing units of the (N-1)th row, respectively, and so on until the computing units of the 2nd row supply power to the computing units of the 1st row, respectively. The second power supply unit supplies power to the M×N computing units, and the N-1 connection interfaces coupled to corresponding computing units of the 1st column of the M×N computing units, respectively.
System, apparatus and method for increasing bandwidth of edge-located agents of an integrated circuit
In one embodiment, a system on chip includes: a plurality of intellectual property (IP) agents formed on a semiconductor die; a mesh interconnect formed on the semiconductor die to couple the plurality of IP agents, and a plurality of mesh stops each to couple one or more of the plurality of IP agents to the mesh interconnect. The mesh interconnect may be formed of a plurality of rows each having one of a plurality of horizontal interconnects and a plurality of columns each having one of a plurality of vertical interconnects;, where at least one of the plurality of rows includes an asymmetrical number of mesh stops. Other embodiments are described and claimed.
Processor architecture
A processor having a functional slice architecture is divided into a plurality of functional units (“tiles”) organized into a plurality of slices. Each slice is configured to perform specific functions within the processor, which may include memory slices (MEM) for storing operand data, and arithmetic logic slices for performing operations on received operand data. The tiles of the processor are configured to stream operand data across a first dimension, and receive instructions across a second dimension orthogonal to the first dimension. The timing of data and instruction flows are configured such that corresponding data and instructions are received at each tile with a predetermined temporal relationship, allowing operand data to be transmitted between the slices of the processor without any accompanying metadata. Instead, each slice is able to determine what operations to perform on received data based upon the timing at which the data is received.
Architecture and programming in a parallel processing environment with a tiled processor having a direct memory access controller
An integrated circuit includes a plurality of tiles. Each tile includes a processor, a switch including switching circuitry to forward data over data paths from other tiles to the processor and to switches of other tiles, and a switch memory that stores instruction streams that are able to operate independently for respective output ports of the switch. Also disclosed is a direct memory access (DMA) scheme in which sizes of DMA transfers are limited according to whether a cache miss has occurred.
DISTRIBUTED GRAPHICS PROCESSOR UNIT ARCHITECTURE
The present disclosure is directed to a distributed graphics processor unit (GPU) architecture that includes an array of processing nodes. Each processing node may include a GPU node that is coupled to its own fast memory unit and its own storage unit. The fast memory unit and storage unit may be integrated into a single unit or may be separately coupled to the GPU node. The processing node may have its fast memory unit coupled to both the GPU node and the storage node. The various architectures provide a GPU-based system that may be treated as a storage unit, such as solid state drive (SSD) that performs onboard processing to perform memory-oriented operations. In this respect, the system may be viewed as a “smart drive” for big-data near-storage processing.
Programmable cache coherent node controller
A computer system includes a first group of CPU modules operatively coupled to at least one first Programmable ASIC Node Controller configured to execute transactions directly or through a first interconnect switch to at least one second Programmable ASIC Node Controller connected to a second group of CPU modules running a single instance of an operating system.
HIGH BANDWIDTH MEMORY SYSTEM WITH DISTRIBUTED REQUEST BROADCASTING MASTERS
A system comprises a processor and a plurality of memory units. The processor is coupled to each of the plurality of memory units by a plurality of network connections. The processor includes a plurality of processing elements arranged in a two-dimensional array and a corresponding two-dimensional communication network communicatively connecting each of the plurality of processing elements to other processing elements on same axes of the two-dimensional array. Each processing element that is located along a diagonal of the two-dimensional array is configured as a request broadcasting master for a respective group of processing elements located along a same axis of the two-dimensional array.
EXECUTION ENGINE FOR EXECUTING SINGLE ASSIGNMENT PROGRAMS WITH AFFINE DEPENDENCIES
The execution engine is a new organization for a digital data processing apparatus, suitable for highly parallel execution of structured fine-grain parallel computations. The execution engine includes a memory for storing data and a domain flow program, a controller for requesting the domain flow program from the memory, and further for translating the program into programming information, a processor fabric for processing the domain flow programming information and a crossbar for sending tokens and the programming information to the processor fabric.
SYSTEMS AND METHODS FOR IMPLEMENTING A MACHINE PERCEPTION AND DENSE ALGORITHM INTEGRATED CIRCUIT AND ENABLING A FLOWING PROPAGATION OF DATA WITHIN THE INTEGRATED CIRCUIT
Systems and methods of propagating data within an integrated circuit includes: identifying a coarse data propagation path for distinct subsets of data of an input dataset that includes: setting inter-core data movements for the distinct subsets of data, the inter-core data movements defining a predetermined propagation of a given subset of data between two or more of a plurality of cores of an integrated circuit array of the integrated circuit; identifying a granular data propagation path for each distinct subset of data that includes: setting intra-core data movements for each distinct subset of data, the intra-core data movements defining a predetermined propagation of the given subset of data within one or more of the plurality of cores of the integrated circuit array of the integrated circuit; enabling a flow of the input dataset within the integrated circuit based on the coarse data propagation path and the granular propagation path.
ARCHITECTURE TO SUPPORT COLOR SCHEME-BASED SYNCHRONIZATION FOR MACHINE LEARNING
A system to support a machine learning (ML) operation comprises an array-based inference engine comprising a plurality of processing tiles each comprising at least one or more of an on-chip memory (OCM) configured to maintain data for local access by components in the processing tile and one or more processing units configured to perform one or more computation tasks on the data in the OCM by executing a set of task instructions. The system also comprises a data streaming engine configured to stream data between a memory and the OCMs and an instruction streaming engine configured to distribute said set of task instructions to the corresponding processing tiles to control their operations and to synchronize said set of task instructions to be executed by each processing tile, respectively, to wait current certain task at each processing tile to finish before starting a new one.