Patent classifications
G06F9/3885
Hierarchical general register file (GRF) for execution block
In an example, an apparatus comprises a plurality of execution units, and a first general register file (GRF) communicatively couple to the plurality of execution units, wherein the first GRF is shared by the plurality of execution units. Other embodiments are also disclosed and claimed.
Synchronization amongst processor tiles
A processing system comprising an arrangement of tiles and an interconnect between the tiles. The interconnect comprises synchronization logic for coordinating a barrier synchronization to be performed between a group of the tiles. The instruction set comprises a synchronization instruction taking an operand which selects one of a plurality of available modes each specifying a different membership of the group. Execution of the synchronization instruction cause a synchronization request to be transmitted from the respective tile to the synchronization logic, and instruction issue to be suspended on the respective tile pending a synchronization acknowledgement being received back from the synchronization logic. In response to receiving the synchronization request from all the tiles in the group as specified by the operand of the synchronization instruction, the synchronization logic returns the synchronization acknowledgment to the tiles in the specified group.
MACHINE LEARNING SPARSE COMPUTATION MECHANISM
Techniques to improve performance of matrix multiply operations are described in which a compute kernel can specify one or more element-wise operations to perform on output of the compute kernel before the output is transferred to higher levels of a processor memory hierarchy.
Pipeline including separate hardware data paths for different instruction types
A processing element is implemented in a stage of a pipeline and configured to execute an instruction. A first array of multiplexers is to provide information associated with the instruction to the processing element in response to the instruction being in a first set of instructions. A second array of multiplexers is to provide information associated with the instruction to the first processing element in response to the instruction being in a second set of instructions. A control unit is to gate at least one of power or a clock signal provided to the first array of multiplexers in response to the instruction being in the second set.
Application And Integration Of A GPU Server System
A graphics processing unit (GPU) server having a GPU host head with one or more host graphics processing units (GPUs). The GPU server further has a GPU system with a plurality of system GPUs that are separate from the host GPUs, and that are configured to rapidly accelerate creation of images for output to a display device. The GPU server also has a mounting assembly that integrates the GPU host head and the GPU system into a single GPU server unit. The GPU host head is independently movable relative to the GPU system.
Parallel training of machine learning models
Parallel training of a machine learning model on a computerized system is described. Computing tasks of a system can be assigned to multiple workers of the system. Training data can be accessed. The machine learning model is trained, whereby the training data accessed are dynamically partitioned across the workers of the system by shuffling subsets of the training data through the workers. As a result, different subsets of the training data are used by the workers over time as training proceeds. Related computerized systems and computer program products are also provided.
Data Processing Method and Interaction System
A data processing method applied to a programmable chip, includes a logic classification unit (LCU) that obtains, based on first data received through a data bus, at least first target classified data (TCD) and second TCD, and sends the first and second TCD to a corresponding first arithmetic logic unit (ALU) and a corresponding second ALU based on a preset mapping relationship. The LCU classifies target execution information obtained through preprocessing an entry by a ternary content addressable memory (TCAM) and service data, so that an instruction memory determines first and second information, and sends the first and second information to the corresponding first and second ALUs. The first and second ALUs respectively send, through the data bus, data obtained through performing calculation based on the first TCD and the first information and data obtained through performing calculation based on the second TCD and the second information.
UNIFIED AUTOMATION OF APPLICATION DEVELOPMENT
Unified automation of application development and delivery is provided. An automation pipeline execution coordinator may define a pipeline specification that includes actions to be performed, a triggering event definition and specification for determining execution context. The coordinator may concurrently detect triggering events for multiple pipelines matching the pipeline specification, and responsive to the detecting, determine execution contexts for the pipelines. The coordinator may then execute the multiple pipelines, where execution may proceed independently for pipelines with differing execution contexts. For pipelines sharing an execution context, execution of various actions of the respective pipelines may be coordinated. Execution context may be determined according to the specification for determining execution context, which may include an overridable default specification that determines context by locations of source data related to the triggering event. Pipeline specifications may be defined using pipeline specification templates and input from users obtained via various user interfaces.
Networked computer
A computer comprising a plurality of processing nodes is provided. Each processing node has at least one processor configured to process input data to generate an array of data items. The processing nodes are arranged in cliques in which each processing node of a clique is connected to each other processing node in the clique by first and second clique links. The cliques are inter-connected in rings such that each processing node is a member of a single clique and a single ring. The processing nodes of all cliques are configured to exchange in each exchange step of a machine learning collective via the respective first and second clique links at least two data items with the other processing node(s) in its clique, and all processing nodes are configured to reduce each received data item with the data item in the corresponding position in the array on that processing node.
APPARATUS AND METHODS EMPLOYING A SHARED READ PORT REGISTER FILE
In some implementations, a processor includes a plurality of parallel instruction pipes, a register file includes at least one shared read port configured to be shared across multiple pipes of the plurality of parallel instruction pipes. Control logic controls multiple parallel instruction pipes to read from the at least one shared read port. In certain examples, the at least one shared register file read port is coupled as a single read port for one of the parallel instruction pipes and as a shared register file read port for a plurality of other parallel instruction pipes.