G06F8/4451

Reducing stalling in a simultaneous multithreading processor by inserting thread switches for instructions likely to stall

A technique suppresses the occurrence of stalling caused by data dependency other than register dependency in an out-of-order processor. A stall reducing program includes a handler for detecting a stall occurring during execution of execution code using a performance monitoring unit, and identifying, based on dependencies, a second instruction on which a first instruction is data dependent, the stall based on this dependency; a profiler registering the second instruction as profile information; and an optimization module for inserting a thread yield instruction in the appropriate position inside the execution code or original code file based on the profile information, and outputting the optimized execution code.

Sharding for synchronous processors
12147793 · 2024-11-19 · ·

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for sharding dataflow graphs for a device having multiple synchronous tiles. One of the methods includes receiving a representation of a dataflow graph comprising a plurality of nodes that each represent respective matrix operations to be performed by a device having a plurality synchronous tiles. Candidate allocations of respective portions of the dataflow graph to each tile of the plurality of synchronous tiles are evaluated according to one or more resource constraints of the device. One of the candidate allocations is selected based on evaluating each candidate allocation.

ELECTRONIC APPARATUS, PROCESSOR AND CONTROL METHOD THEREOF
20180088954 · 2018-03-29 ·

An electronic apparatus is provided for obtaining compiling data used in an external processor including a function unit including a plurality of input ports. The electronic apparatus includes a storage configured to store a plurality of instructions, and a processor configured to schedule each of the plurality of instructions in a plurality of cycles, assign a plurality of input data corresponding to the plurality of instructions to the plurality of input ports in a corresponding cycle, and if an unassigned input port among the plurality of input ports is present in a first cycle, assign a part of input data corresponding to an instruction scheduled in a second cycle after the first cycle to the unassigned input port in the first cycle, and obtain the compiling data by assigning remaining data of the input data corresponding the instruction to one of the plurality of input ports in the second cycle.

SHARDING FOR SYNCHRONOUS PROCESSORS
20250045032 · 2025-02-06 ·

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for sharding dataflow graphs for a device having multiple synchronous tiles. One of the methods includes receiving a representation of a dataflow graph comprising a plurality of nodes that each represent respective matrix operations to be performed by a device having a plurality synchronous tiles. Candidate allocations of respective portions of the dataflow graph to each tile of the plurality of synchronous tiles are evaluated according to one or more resource constraints of the device. One of the candidate allocations is selected based on evaluating each candidate allocation.

Schedulers with load-store queue awareness

In one embodiment, a computer-implemented method includes tracking a size of a load-store queue (LSQ) during compile time of a program. The size of the LSQ is time-varying and indicates how many memory access instructions of the program are on the LSQ. The method further includes scheduling, by a computer processor, a plurality of memory access instructions of the program based on the size of the LSQ.

Schedulers with load-store queue awareness

In one embodiment, a computer-implemented method includes tracking a size of a load-store queue (LSQ) during compile time of a program. The size of the LSQ is time-varying and indicates how many memory access instructions of the program are on the LSQ. The method further includes scheduling, by a computer processor, a plurality of memory access instructions of the program based on the size of the LSQ.