G06F9/3881

Dynamic Granular Memory Power Gating for Hardware Accelerators

In an embodiment, a local memory that is dedicated to one or more hardware accelerators is divided into a plurality of independently powerable sections. That is, one or more of the sections may be powered on while other ones of the plurality of sections are powered off. The hardware accelerators receive instruction words from one or more central processing units (CPUs). The instruction words may include a field that specifies an amount of the memory that is used when processing the first instruction word, and the power control circuit may be configured to power a subset of the plurality of sections to provide sufficient memory for the instruction word based on the field, while one or more of the plurality of sections are powered off.

Hybrid Memory in a Dynamically Power Gated Hardware Accelerator

In an embodiment, a local memory dedicated to one or more hardware accelerators in a system may include at least two portions: a volatile portion and a non-volatile portion. Data that is reused from iteration to iteration of the hardware accelerator (e.g. constants, instruction words, etc.) may be stored in the non-volatile portion. Data that varies from iteration to iteration may be stored in the volatile portion. Both the local memory and the hardware accelerators may be powered down between iterations, saving power. The non-volatile portion need only be initialized at a first iteration, allowing the amount of time that the hardware accelerators and the local memory are powered up to be lessened for subsequent iterations since the reused data need not be reloaded in the subsequent iterations.

Vector processor storage

A method comprising: receiving, at a vector processor, a request to store data; performing, by the vector processor, one or more transforms on the data; and directly instructing, by the vector processor, one or more storage device to store the data; wherein performing one or more transforms on the data comprises: erasure encoding the data to generate n data fragments configured such that any k of the data fragments are usable to regenerate the data, where k is less than n; and wherein directly instructing one or more storage device to store the data comprises: directly instructing the one or more storage devices to store the plurality of data fragments.

Computing accelerator for processing multiple-type instruction and operation method thereof

Disclosed is a general-purpose computing accelerator which includes a memory including an instruction cache, a first executing unit performing a first computation operation, a second executing unit performing a second computation operation, an instruction fetching unit fetching an instruction stored in the instruction cache, a decoding unit that decodes the instruction, and a state control unit controlling a path of the instruction depending on an operation state of the second executing unit. The decoding unit provides the instruction to the first executing unit when the instruction is of a first type and provides the instruction to the state control unit when the instruction is of a second type. Depending on the operation state of the second executing unit, the state control unit provides the instruction of the second type to the second executing unit or stores the instruction of the second type as a register file in the memory.

Coprocessor prefetcher

A prefetcher for a coprocessor is disclosed. An apparatus includes a processor and a coprocessor that are configured to execute processor and coprocessor instructions, respectively. The processor and coprocessor instructions appear together in code sequences fetched by the processor, with the coprocessor instructions being provided to the coprocessor by the processor. The apparatus further includes a coprocessor prefetcher configured to monitor a code sequence fetched by the processor and, in response to identifying a presence of coprocessor instructions in the code sequence, capture the memory addresses, generated by the processor, of operand data for coprocessor instructions. The coprocessor is further configured to issue, for a cache memory accessible to the coprocessor, prefetches for data associated with the memory addresses prior to execution of the coprocessor instructions by the coprocessor.

ELECTRONIC APPARATUS AND METHOD FOR REDUCING NUMBER OF COMMANDS
20230205541 · 2023-06-29 · ·

An electronic apparatus and a method for reducing the number of commands are provided. The electronic apparatus includes a central processor and a co-processor. The central processor generates a plurality of original register setting commands to set at least one bit of at least one register of the co-processor. The original register setting commands include a plurality of first original register setting commands, and a plurality of setting targets of the first original register setting commands have address continuity. The central processor merges the first original register setting commands to generate at least one merged register setting command. The central processor transmits the at least one merged register setting command to the co-processor.

PERSISTENT STORAGE DEVICE MANAGEMENT

A method comprising: receiving a request to write data at a virtual location; writing the data to a physical location on a persistent storage device; and recording a mapping from the virtual location to the physical location; wherein the physical location corresponds to a next free block in a sequence of blocks on the persistent storage device.

Acceleration circuitry for posit operations

Systems, apparatuses, and methods related to acceleration circuitry for posit operations are described. Signaling indicative of performance of an operation to write a first bit string to a first buffer resident on acceleration circuitry and a second bit string resident on the acceleration circuitry can be received at an DMA controller couplable to the acceleration circuitry. The acceleration circuitry can be configured to perform arithmetic operations, logical operations, or both on bit strings formatted in a unum or posit format. Signaling indicative of an arithmetic operation, a logical operation, or both, to be performed using the first and second bit strings can be transmitted to the acceleration circuitry. The arithmetic operation, the logical operation, or both can be performed via the acceleration circuitry and according to the signaling. Signaling indicative of a result of the arithmetic operation, the logical operation, or both can be transmitting to the DMA controller.

Methods of breaking down coarse-grained tasks for fine-grained task re-scheduling

A method of scheduling instructions in a processing system comprising a processing unit and one or more co-processors comprises dispatching a plurality of instructions from a master processor to a co-processor of the one or more co-processors, wherein each instruction of the plurality of instructions comprises one or more additional fields, wherein at least one field comprises grouping information operable to consolidate the plurality of instructions for decomposition, and wherein at least one field comprises control information. The method also comprises decomposing the plurality of instructions into a plurality of fine-grained instructions, wherein the control information comprises rules associated with decomposing the plurality of instructions into the plurality of fine-grained instructions. Further, the method comprises scheduling the plurality of fine-grained instructions to execute on the co-processor, wherein the scheduling is performed in a non-sequential order.

Dynamic granular memory power gating for hardware accelerators

In an embodiment, a local memory that is dedicated to one or more hardware accelerators is divided into a plurality of independently powerable sections. That is, one or more of the sections may be powered on while other ones of the plurality of sections are powered off. The hardware accelerators receive instruction words from one or more central processing units (CPUs). The instruction words may include a field that specifies an amount of the memory that is used when processing the first instruction word, and the power control circuit may be configured to power a subset of the plurality of sections to provide sufficient memory for the instruction word based on the field, while one or more of the plurality of sections are powered off.