Patent classifications
G06F15/8015
Synchronization Amongst Processor Tiles
A processing system comprising an arrangement of tiles and an interconnect between the tiles. The interconnect comprises synchronization logic for coordinating a barrier synchronization to be performed between a group of the tiles. The instruction set comprises a synchronization instruction taking an operand which selects one of a plurality of available modes each specifying a different membership of the group. Execution of the synchronization instruction cause a synchronization request to be transmitted from the respective tile to the synchronization logic, and instruction issue to be suspended on the respective tile pending a synchronization acknowledgement being received back from the synchronization logic. In response to receiving the synchronization request from all the tiles in the group as specified by the operand of the synchronization instruction, the synchronization logic returns the synchronization acknowledgment to the tiles in the specified group.
PROCESSOR MEMORY ACCESS
A computing device comprising: a plurality of ALUs; a set of registers; a memory; a memory interface between the registers and the memory; a control unit controlling the ALUs by generating: at least one cycle i including both implementing at least one first computing operation by way of an arithmetic logic unit and downloading a first dataset from the memory to at least one register; at least one cycle ii, following the at least one cycle i, including implementing a second computing operation by way of an arithmetic logic unit, for which second computing operation at least part of the first dataset forms at least one operand.
Arithmetic processing device, information processing apparatus, and control method of the arithmetic processing device
An arithmetic processing device includes arithmetic processing units configured to perform arithmetic processing; first routers connected to the plurality of arithmetic processing units, respectively; first buses connecting the plurality of first routers in a ring shape; and second buses connecting between one of the plurality of first routers and any one of the other first routers excluding the first routers directly connected through the first buses.
Techniques for tracking independent hardware graphics processing unit (GPU) performance
Examples described herein generally relate to indicating resource utilization by a graphics processing unit (GPU). Data indicating a hierarchy of architectural units for executing processing threads on a GPU can be obtained. An indication of a slot assigned to a collection of threads for executing on the GPU can be received, where the slot is associated with a single instruction multiple data (SIMD) module capable of concurrently executing multiple collections of threads. An architectural unit to which the slot is assigned can be determined based on the data indicating the hierarchy of architectural units. An indication of the architectural unit as executing the collection of threads can be outputted.
Apparatuses, methods, and systems for vector processor architecture having an array of identical circuit blocks
Systems, methods, and apparatuses relating to vector processor architecture having an array of identical circuit blocks are described. In one embodiment, a processor includes a single centralized circuit comprising an instruction decoder and a controller; and a plurality of circuit slices that each comprise an arithmetic logic unit, a multiplier, a register file, a local memory, and a same plurality of logic circuits and a packed data datapath in between, wherein each circuit slice includes a physical port that provides a unique identification value that identifies a circuit slice from the other circuit slices, and the controller is to broadcast a same configuration value to the plurality of circuit slices to cause a first circuit slice to enable a first logic circuit and enable a second logic circuit of the first circuit slice based on its unique identification value and the configuration value, and cause a second circuit slice to enable a same, first logic circuit and disable a same, second logic circuit of the second circuit slice based on its unique identification value and the configuration value.
HIGH BANDWIDTH MEMORY SYSTEM WITH DISTRIBUTED REQUEST BROADCASTING MASTERS
A system comprises a processor and a plurality of memory units. The processor is coupled to each of the plurality of memory units by a plurality of network connections. The processor includes a plurality of processing elements arranged in a two-dimensional array and a corresponding two-dimensional communication network communicatively connecting each of the plurality of processing elements to other processing elements on same axes of the two-dimensional array. Each processing element that is located along a diagonal of the two-dimensional array is configured as a request broadcasting master for a respective group of processing elements located along a same axis of the two-dimensional array.
COMPUTATIONAL MEMORY WITH COOPERATION AMONG ROWS OF PROCESSING ELEMENTS AND MEMORY THEREOF
A computing device includes an array of processing elements mutually connected to perform single instruction multiple data (SIMD) operations, memory cells connected to each processing element to store data related to the SIMD operations, and a cache connected to each processing element to cache data related to the SIMD operations. Caches of adjacent processing elements are connected. The same or another computing device includes rows of mutually connected processing elements to share data. The computing device further includes a row arithmetic logic unit (ALU) at each row of processing elements. The row ALU of a respective row is configured to perform an operation with processing elements of the respective row.
Synchronization amongst processor tiles
A processing system comprising an arrangement of tiles and an interconnect between the tiles. The interconnect comprises synchronization logic for coordinating a barrier synchronization to be performed between a group of the tiles. The instruction set comprises a synchronization instruction taking an operand which selects one of a plurality of available modes each specifying a different membership of the group. Execution of the synchronization instruction cause a synchronization request to be transmitted from the respective tile to the synchronization logic, and instruction issue to be suspended on the respective tile pending a synchronization acknowledgement being received back from the synchronization logic. In response to receiving the synchronization request from all the tiles in the group as specified by the operand of the synchronization instruction, the synchronization logic returns the synchronization acknowledgment to the tiles in the specified group.
Shift register with reduced wiring complexity
A shift register is described. The shift register includes a plurality of cells and register space. The shift register includes circuitry having inputs to receive shifted data and outputs to transmit shifted data, wherein: i) circuitry of cells physically located between first and second logically ordered cells are configured to not perform any logical shift; ii) circuitry of cells coupled to receive shifted data transmitted by an immediately preceding logically ordered cell comprises circuitry for writing into local register space data received at an input assigned an amount of shift specified in a shift command being executed by the shift register, and, iii) circuitry of cells coupled to transmit shifted data to an immediately following logically ordered cell comprises circuitry to transmit data from an output assigned an incremented shift amount from a shift amount of an input that the data was received on.
ASYNCHRONOUS PROCESSOR ARCHITECTURE
A data processing method comprising: a control unit, at least one ALU, a set of registers, a memory and a memory interface. The method comprises: a) obtaining the memory addresses of the operands; b) reading the operands from memory; c) transmitting an instruction to execute computing operations to the ALU without any addressing instruction; d) executing all of the elementary operations by way of the ALU receiving, at input, each of the operands from the registers; e) storing the data forming results of the processing operation on the registers; f) obtaining a memory address for each of the data forming a result of the processing operation; g) writing the results to memory for storage and via the memory interface, by way of the obtained memory addresses.