G06F9/3869

QED SHIFTER FOR A MEMORY DEVICE
20230065930 · 2023-03-02 ·

A memory device includes a command interface configured to receive a command from a host device. The memory device also includes a command shifter configured to receive the command. The command shifter includes a plurality of stages coupled in series and configured to delay the command. The command shifter comprises selection circuitry configured to receive the command and to select an insertion stage of the plurality of stages for the command. The selection circuitry is configured to select the insertion stage as a location to insert the command. The selected insertion stage is selected to control a duration of delay in the command shifter. The selection of the insertion stage is based at least in part on a path delay between a clock and a data pin of the memory device.

VARIABLE PIPELINE LENGTH IN A BARREL-MULTITHREADED PROCESSOR
20230106087 · 2023-04-06 ·

Devices and techniques for variable pipeline length in a barrel-multithreaded processor are described herein. A completion time for an instruction can be determined prior to insertion into a pipeline of a processor. A conflict between the instruction and a different instruction based on the completion time can be detected. Here, the different instruction is already in the pipeline and the conflict detected when the completion time equals the previously determined completion time for the different instruction. A difference between the completion time and an unconflicted completion time can then be calculated and completion of the instruction delayed by the difference.

Multiplier-accumulator circuitry having processing pipelines and methods of operating same

An integrated circuit including memory to store image data and filter weights, and a plurality of multiply-accumulator execution pipelines, each multiply-accumulator execution pipeline coupled to the memory to receive (i) image data and (ii) filter weights, wherein each multiply-accumulator execution pipeline processes the image data, using associated filter weights, via a plurality of multiply and accumulate operations. In one embodiment, the multiply-accumulator circuitry of each multiply-accumulator execution pipeline, in operation, receives a different set of image data, each set including a plurality of image data, and, using filter weights associated with the received set of image data, processes the set of image data associated therewith, via performing a plurality of multiply and accumulate operations concurrently with the multiply-accumulator circuitry of the other multiply-accumulator execution pipelines, to generate output data. Each set of image data includes all of the image that correlates to the output data generated therefrom.

Fast Recovery for Dual Core Lock Step
20230143422 · 2023-05-11 · ·

An exemplary fault-tolerant computing system comprises a secondary processor configured to execute in delayed lock step with a primary processor from a common program store, comparators in the store data and writeback paths to detect a fault based on comparing primary and secondary processor states, and a writeback path delay permitting aborting execution when a fault is detected, before writeback of invalid data. The secondary processor execution and the primary processor store data and writeback may be delayed a predetermined number of cycles, permitting fault detection before writing invalid data. Store data and writeback paths may include triple module redundancy configured to pass only majority data through the store data and writeback path delay stages. Some implementations may forward data from the store data path delay stages to the writeback stage or memory if the load data address matches the address of data in a store data path delay stage.

Handling exceptions in a multi-tile processing arrangement

A multitile processing system has an execution unit on each tile, and an interconnect which conducts communications between the tiles according to a bulk synchronous parallel scheme. Each tile performs an on-tile compute phase followed by an intertile exchange phase, where the exchange phase is held back until all tiles in a particular group have completed the compute phase. On completion of the compute phase, each tile generates a synchronisation request and pauses an issue of instructions until it receives a synchronisation acknowledgement. If a tile attains an excepted state, it raises an exception signal and pauses instruction issue until the excepted state has been resolved. However, tiles which are not in the excepted state can continue to perform their on-tile computer phase, and will issue their own synchronisation request in their own normal time frame. Synchronisation acknowledgements will not be received from all of the tiles in the group until the excepted state has been resolved on the tile with the excepted state.

Memory interface management
11650925 · 2023-05-16 · ·

A method includes receiving a signal at a memory sub-system controller to perform an operation. The method can further include, in response to receiving the signal, enabling, by the memory sub-system controller, an interface to transfer data to or from a registering clock driver (RCD) component. The RCD component is coupled to the memory sub-system controller. The method can further include transferring the data to or from the RCD component via the interface. The method can further include, in response to the enablement of the interface being unsuccessful, transferring control of a memory device to the memory sub-system controller.

Wavefront selection and execution
11656877 · 2023-05-23 · ·

Techniques are provided for executing wavefronts. The techniques include at a first time for issuing instructions for execution, performing first identifying, including identifying that sufficient processing resources exist to execute a first set of instructions together within a processing lane; in response to the first identifying, executing the first set of instructions together; at a second time for issuing instructions for execution, performing second identifying, including identifying that no instructions are available for which sufficient processing resources exist for execution together within the processing lane; and in response to the second identifying, executing an instruction independently of any other instruction.

SYSTEM AND METHODS FOR TAG-BASED SYNCHRONIZATION OF TASKS FOR MACHINE LEARNING OPERATIONS
20230205540 · 2023-06-29 ·

A new approach for supporting tag-based synchronization among different tasks of a machine learning (ML) operation. When a first task tagged with a set tag indicating that one or more subsequent tasks need to be synchronized with it is received at an instruction streaming engine, the engine saves the set tag in a tag table and transmits instructions of the first task to a set of processing tiles for execution. When a second task having an instruction sync tag indicating that it needs to be synchronized with one or more prior tasks is received at the engine, the engine matches the instruction sync tag with the set tags in the tag table to identify prior tasks that the second task depends on. The engine holds instructions of the second task until these matching prior tasks have been completed and then releases the instructions to the processing tiles for execution.

Iteration Synchronization Construct for Parallel Pipelines
20170371675 · 2017-12-28 ·

Embodiments include computing devices, apparatus, and methods implemented by the apparatus for implementing an iteration synchronization construct (ISC) for a parallel pipeline. The apparatus may initialize a first instance of the ISC for a first stage iteration of a first parallel stage of the parallel pipeline and a second instance of the ISC for a second stage iteration of the first parallel stage of the parallel pipeline. The apparatus may determine whether an execution control value is specified for the first stage iteration, and add a first execution control edge to the parallel pipeline after determining that an execution control value is specified for the first stage iteration. The apparatus may determine whether execution of the first stage iteration is complete and send a ready signal from the first instance of the ISC to the second instance if the ISC after determining that execution of the first stage iteration completed.

Arithmetic processing device, arithmetic processing system, and method for controlling arithmetic processing device
09846580 · 2017-12-19 · ·

An arithmetic processing device includes: a arithmetic cores, wherein the arithmetic core comprises: an instruction controller configured to request processing corresponding to an instruction; a memory configured to store lock information indicating that a locking target address is locked, the locking target address, and priority information of the instruction; and a cache controller configured to, when storing data of a first address in a cache memory to execute a first instruction including locking of the first address from the instruction controller, suppress updating of the memory if the lock information is stored in the memory and a priority of the priority information is higher than a first priority of the first instruction.