G06F9/3816

Multiported parity scoreboard circuit
11455171 · 2022-09-27 · ·

A fast and frugal item-state tracking scoreboard circuit is disclosed. The scoreboard maintains per-item partial states across multiple memory circuits, enabling multiple lookups per clock cycle and multiple state updates per clock cycle. In an embodiment a scoreboard is used to schedule instructions in an out-of-order processor. Each clock cycle the scoreboard indicates the busy state of an instruction's registers and may update the busy state of the destination registers of issuing instructions and completing instructions. Applications include register tracking, function-unit tracking, and cache-line state tracking, in embodiments including processor cores (including superscalar, superpipelined, and multithreaded processors), accelerators, memory systems, and networks. In an embodiment, a register-busy scoreboard circuit is implemented using FPGA LUT RAM memory. In an embodiment, a three-read/two-write per cycle register file scoreboard of 64 registers uses 16 LUTs and indicates whether an instruction is issuable in two LUT delays.

Page modification encoding and caching

Modifying a page stored in a non-volatile storage includes receiving one or more requests to modify data stored in the page with new data. One or more lines are identified in the page that include data to be modified by the one or more requests. The identified one or more lines correspond to one or more respective byte ranges each of a predetermined size in the page. Encoded data is created based on the new data and respective locations of the one or more identified lines in the page. The encoded data is cached, and at least a portion of the cached encoded data is used to rewrite the page in the non-volatile storage to include at least a portion of the new data.

Coprocessor operation bundling

In an embodiment, a processor includes a buffer in an interface unit. The buffer may be used to accumulate coprocessor instructions to be transmitted to a coprocessor. In an embodiment, the processor issues the coprocessor instructions to the buffer when ready to be issued to the coprocessor. The interface unit may accumulate the coprocessor instructions in the buffer, generating a bundle of instructions. The bundle may be closed based on various predetermined conditions and then the bundle may be transmitted to the coprocessor. If a sequence of coprocessor instructions appears consecutively in a program, the rate at which the instructions are provided to the coprocessor (on average) at least matches the rate at which the coprocessor consumes the instructions, in an embodiment.

QUALITY OF SERVICE DIRTY LINE TRACKING
20210390057 · 2021-12-16 ·

Systems, apparatuses, and methods for generating a measurement of write memory bandwidth are disclosed. A control unit monitors writes to a cache hierarchy. If a write to a cache line is a first time that the cache line is being modified since entering the cache hierarchy, then the control unit increments a write memory bandwidth counter. Otherwise, if the write is to a cache line that has already been modified since entering the cache hierarchy, then the write memory bandwidth counter is not incremented. The first write to a cache line is a proxy for write memory bandwidth since this will eventually cause a write to memory. The control unit uses the value of the write memory bandwidth counter to generate a measurement of the write memory bandwidth. Also, the control unit can maintain multiple counters for different thread classes to calculate the write memory bandwidth per thread class.

APPLICATION PROGRAMMING INTERFACE FOR FINE GRAINED LOW LATENCY DECOMPRESSION WITHIN PROCESSOR CORE

Methods and apparatus relating to an Application Programming Interface (API) for fine grained low latency decompression within a processor core are described. In an embodiment, a decompression Application Programming Interface (API) receives an input handle to a data object. The data object includes compressed data and metadata. Decompression Engine (DE) circuitry decompresses the compressed data to generate uncompressed data. The DE circuitry decompress the compressed data in response to invocation of a decompression instruction by the decompression API. The metadata comprises a first operand to indicate a location of the compressed data, a second operand to indicate a size of the compressed data, a third operand to indicate a location to which decompressed data by the DE circuitry is to be stored, and a fourth operand to indicate a size of the decompressed data. Other embodiments are also disclosed and claimed.

Computing apparatus incorporating quantum effects that performs high-speed computation on inverse problems or computational optimization problems requiring exhaustive search
11341425 · 2022-05-24 · ·

A computing apparatus that does not need quantum coherence or a cryogenic cooling apparatus is provided for assignments that need an exhaustive search. A system is led to the ground state of the system where a problem is set, wherein spin s.sub.j.sup.z that is a variable follows a local effective magnetic field B.sub.j.sup.z. The spin state at t=0 is initialized with a transverse field (in the x-direction). This corresponds to s.sub.j.sup.z=0. With time t, the magnetic field in the z-axis direction and the inter-spin interactions are gradually added, and finally the spin is directed to the +z- or −z-direction. The z component of the spin s.sub.j is s.sub.j.sup.z=+1 or −1. Here, in the process where the orientation of the spin s.sub.j.sup.z follows that of the effective magnetic field B.sub.j.sup.z, correction parameters originating in quantum-mechanical effects are introduced and ground-state-maintaining performance is improved.

Storage device configured to perform an alignment operation and storage system including the same

A storage device includes a non-volatile memory including a plurality of memory blocks. The storage device performs an alignment operation in response to receipt of an align command. The alignment operation converts a received logical address of a logical segment into a physical address and allocates the physical address to a physical block address corresponding to a free block. The storage device is further configured to performs a garbage collection in units of the physical block address that indicates one memory block.

PIPELINED READ-MODIFY-WRITE OPERATIONS IN CACHE MEMORY

In described examples, a processor system includes a processor core that generates memory write requests, a cache memory, and a memory pipeline of the cache memory. The memory pipeline has a holding buffer, an anchor stage, and an RMW pipeline. The anchor stage determines whether a data payload of a write request corresponds to a partial write. If so, the data payload is written to the holding buffer and conforming data is read from a corresponding cache memory address to merge with the data payload. The RMW pipeline has a merge stage and a syndrome generation stage. The merge stage merges the data payload in the holding buffer with the conforming data to make merged data. The syndrome generation stage generates an ECC syndrome using the merged data. The memory pipeline writes the data payload and ECC syndrome to the cache memory.

System and method for dynamic enforcement of store atomicity
11334485 · 2022-05-17 · ·

A computer system for dynamic enforcement of store atomicity includes multiple processor cores, local cache memory for each processor core, a shared memory, a separate store buffer for each processor core for executed stores that are not yet performed and a coherence mechanism. A first processor core load on a first processor core receives a value at a first time from a first processor core store in the store buffer and prevents any other first processor core load younger than the first processor core load in program order from committing until a second time when the first processor core store is performed. Between the first time and the second time any load younger in program load than the first processor core load and having an address matched by coherence invalidation or an address matched by an eviction is squashed.

Coprocessor Operation Bundling
20220137975 · 2022-05-05 ·

In an embodiment, a processor includes a buffer in an interface unit. The buffer may be used to accumulate coprocessor instructions to be transmitted to a coprocessor. In an embodiment, the processor issues the coprocessor instructions to the buffer when ready to be issued to the coprocessor. The interface unit may accumulate the coprocessor instructions in the buffer, generating a bundle of instructions. The bundle may be closed based on various predetermined conditions and then the bundle may be transmitted to the coprocessor. If a sequence of coprocessor instructions appears consecutively in a program, the rate at which the instructions are provided to the coprocessor (on average) at least matches the rate at which the coprocessor consumes the instructions, in an embodiment.