G06F15/78

Data input/output operations during loop execution in a reconfigurable compute fabric
11709796 · 2023-07-25 · ·

Various examples are directed to systems and methods in which a first flow controller of a first synchronous flow may receive an instruction to execute a first loop using the first synchronous flow. The first flow controller may determine a first iteration index for a first iteration of the first loop. The first flow controller may send, to a first compute element of the first synchronous flow, a first synchronous message to initiate a first synchronous flow thread for executing the first iteration of the first loop. The first synchronous message may comprise the iteration index. The first compute element may execute an input/output operation at a first location of a first compute element memory indicated by the first iteration index.

TECHNIQUES FOR METADATA PROCESSING
20180011708 · 2018-01-11 ·

Techniques are described for metadata processing that can be used to encode an arbitrary number of security policies for code running on a processor. Metadata may be added to every word in the system and a metadata processing unit may be used that works in parallel with data flow to enforce an arbitrary set of policies. In one aspect, the metadata may be characterized as unbounded and software programmable to be applicable to a wide range of metadata processing policies. Techniques and policies have a wide range of uses including, for example, safety, security, and synchronization. Additionally, described are aspects and techniques in connection with metadata processing in an embodiment based on the RISC-V architecture.

USING DATA PATTERN TO MARK CACHE LINES AS INVALID

An apparatus includes a cache controller, the cache controller to receive, from a requestor, a memory access request referencing a memory address of a memory. The cache controller may identify a cache entry associated with the memory address, and responsive to determining that a first data item stored in the cache entry matches a data pattern indicating cache entry invalidity, read a second data item from a memory location identified by the memory address. The cache controller may then return, to the requestor, a response comprising the second data item.

Graphics processors and graphics processing units having dot product accumulate instruction for hybrid floating point format

Described herein is a graphics processing unit (GPU) comprising a first processing cluster to perform parallel processing operations, the parallel processing operations including a ray tracing operation and a matrix multiply operation; and a second processing cluster coupled to the first processing cluster, wherein the first processing cluster includes a floating-point unit to perform floating point operations, the floating-point unit is configured to process an instruction using a bfloat16 (BF16) format with a multiplier to multiply second and third source operands while an accumulator adds a first source operand with output from the multiplier.

MEMORY DEVICE FOR PROCESSING OPERATION, DATA PROCESSING SYSTEM INCLUDING THE SAME, AND METHOD OF OPERATING THE MEMORY DEVICE
20230236836 · 2023-07-27 ·

A memory device includes a memory having a memory bank, a processor in memory (PIM) circuit, and control logic. The PIM circuit includes instruction memory storing at least one instruction provided from a host. The PIM circuit is configured to process an operation using data provided by the host or data read from the memory bank and to store at least one instruction provided by the host. The control logic is configured to decode a command/address received from the host to generate a decoding result and to perform a control operation so that one of i) a memory operation on the memory bank is performed and ii) the PIM circuit performs a processing operation, based on the decoding result. A counting value of a program counter instructing a position of the instruction memory is controlled in response to the command/address instructing the processing operation be performed.

SYNCHRONIZED DATA CHAINING USING ON-CHIP CACHE
20230005095 · 2023-01-05 ·

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for generating, by an image sensor of a computing device, frame data comprising sub-frames of image pixel data. A first resource of the system-on-chip provides the frame data to a second resource of the system-on-chip. The frame data is provided to the second resource using a first data path included in the system-on-chip. The first resource provides a token to the second resource using a second data path included in the system-on-chip. A processor of the system-on-chip, uses the token to synchronize production of sub-frames of image pixel data provided by the first resource to the second resource and to synchronize consumption of the sub-frames of image pixel data received by the second resource from the elastic memory buffer.

Energy-aware computing system
11567561 · 2023-01-31 · ·

An energy-aware system is provided. The system includes an energy harvester adapted to supply harvested energy as an output for storage at an energy storage; and a scheduler, the scheduler being made up of, at least in part, hardware of the energy-aware system, the scheduler operable to schedule execution of operations performed by the energy-aware system, wherein the scheduler is configured to: determine if a current voltage level at the energy storage is higher than a start voltage level; and cause initiation of execution of at least a portion one of the operations when the start voltage of the one of the operations levels is lower than or equal to the current voltage level.

Control registers to store thread identifiers for threaded loop execution in a self-scheduling reconfigurable computing fabric
11567766 · 2023-01-31 · ·

Representative apparatus, method, and system embodiments are disclosed for configurable computing. A representative system includes an interconnection network; a processor; and a plurality of configurable circuit clusters. Each configurable circuit cluster includes a plurality of configurable circuits arranged in an array; a synchronous network coupled to each configurable circuit of the array; and an asynchronous packet network coupled to each configurable circuit of the array. A representative configurable circuit includes a configurable computation circuit and a configuration memory having a first, instruction memory storing a plurality of data path configuration instructions to configure a data path of the configurable computation circuit; and a second, instruction and instruction index memory storing a plurality of spoke instructions and data path configuration instruction indices for selection of a master synchronous input, a current data path configuration instruction, and a next data path configuration instruction for a next configurable computation circuit.

POWER EFFICIENT MEMORY VALUE UPDATES FOR ARM ARCHITECTURES

Disclosed are various examples of providing provide efficient waiting for detection of memory value updates for Advanced RISC Machines (ARM) architectures. An ARM processor component instructs a memory agent to perform a processing action, and executes a waiting function. The waiting function ensures that the processing action is completed by the memory agent. The waiting function performs an exclusive load at a memory location, and a wait for event (WFE) instruction that causes the ARM processor component to wait in a low-power mode for an event register to be set. Once the event register is set, the waiting function completes and a second processing action is executed by the ARM processor component.

Compiler flow logic for reconfigurable architectures

The technology disclosed partitions a dataflow graph of a high-level program into memory allocations and execution fragments. The memory allocations represent creation of logical memory spaces in on-processor and/or off-processor memories for data required to implement the dataflow graph. The execution fragments represent operations on the data. The technology disclosed designates the memory allocations to virtual memory units and the execution fragments to virtual compute units. The technology disclosed partitions the execution fragments into memory fragments and compute fragments, and assigns the memory fragments to the virtual memory units and the compute fragments to the virtual compute units. The technology disclosed then allocates the virtual memory units to physical memory units and the virtual compute units to physical compute units. It then places the physical memory units and the physical compute units onto positions in the array of configurable units and routes data and control networks between the placed positions.