Patent classifications
G06F9/3824
Apparatus and method for burst mode data storage
An apparatus and a method are disclosed. In the apparatus, a memory management unit includes: a first cache unit, adapted to store a plurality of first source operands and one first write address; a second cache unit, adapted to store at least one pair of a second source operand and a second destination address; a write cache module, adapted to discriminate between destination addresses of a plurality of store instructions, so as to store, in the first cache unit, a plurality of source operands corresponding to consecutive destination addresses, and to store, in the second cache unit, non-consecutive destination addresses and source operands corresponding to the non-consecutive destination addresses, where the first write address is an initial address of the consecutive destination addresses; and a bus transmission module, adapted to transmit the plurality of first source operands and the first write address in the first cache unit to a memory through a bus in a write burst transmission mode. In embodiments of the present invention, a burst transmission mode is established between a processor and the bus. This can help reduce occupation of bus address bandwidth and accelerate write efficiency of the memory.
Hierarchical register file device based on spin transfer torque-random access memory
The embodiments provide a register file device which increases energy efficiency using a spin transfer torque-random access memory for a register file used to compute a general purpose graphic processing device, and hierarchically uses a register cache and a buffer together with the spin transfer torque-random access memory, to minimize leakage current, reduce a write operation power, and solve the write delay.
Matrix data broadcast architecture
Systems, apparatuses, and methods for efficient parallel execution of multiple work units in a processor by reducing a number of memory accesses are disclosed. A computing system includes a processor core with a parallel data architecture. The processor core executes a software application with matrix operations. The processor core supports the broadcast of shared data to multiple compute units of the processor core. A compiler or other code assigns thread groups to compute units based on detecting shared data among the compute units. Rather than send multiple read accesses to a memory subsystem for the shared data, the processor core generates a single access request. The single access request includes information to identify the multiple compute units for receiving the shared data when broadcasted by the processor core.
Inserting a proxy read instruction in an instruction pipeline in a processor
Inserting a proxy read instruction in an instruction pipeline in a processor is disclosed. A scheduler circuit is configured to recognize when a produced value generated by execution of a producer instruction in the instruction pipeline will not be available through a data forwarding path to be consumed for processing of a subsequent consumer instruction. In this case, the scheduling circuit is configured to insert a proxy read instruction in the instruction pipeline to cause execution of an operation to generate the same produced value as was generated by previous execution of producer instruction in the instruction pipeline. Thus, the produced value will remain available in the instruction pipeline to again be available through a data forwarding path to an earlier stage of the instruction pipeline to be consumed by a consumer instruction, which may avoid a pipeline stall.
APPARATUSES, METHODS, AND SYSTEMS TO PRECISELY MONITOR MEMORY STORE ACCESSES
Systems, methods, and apparatuses relating to circuitry to precisely monitor memory store accesses are described. In one embodiment, a system includes a memory, a hardware processor core comprising a decoder to decode an instruction into a decoded instruction, an execution circuit to execute the decoded instruction to produce a resultant, a store buffer, and a retirement circuit to retire the instruction when a store request for the resultant from the execution circuit is queued into the store buffer for storage into the memory, and a performance monitoring circuit to mark the retired instruction for monitoring of post-retirement performance information between being queued in the store buffer and being stored in the memory, enable a store fence after the retired instruction to be inserted that causes previous store requests to complete within the memory, and on detection of completion of the store request for the instruction in the memory, store the post-retirement performance information in storage of the performance monitoring circuit.
Coalescing adjacent gather/scatter operations
According to one embodiment, a processor includes an instruction decoder to decode a first instruction to gather data elements from memory, the first instruction having a first operand specifying a first storage location and a second operand specifying a first memory address storing a plurality of data elements. The processor further includes an execution unit coupled to the instruction decoder, in response to the first instruction, to read contiguous a first and a second of the data elements from a memory location based on the first memory address indicated by the second operand, and to store the first data element in a first entry of the first storage location and a second data element in a second entry of a second storage location corresponding to the first entry of the first storage location.
Hardware-implemented universal floating-point instruction set architecture for computing directly with human-readable decimal character sequence floating-point representation operands
A universal floating-point Instruction Set Architecture (ISA) compute engine implemented entirely in hardware. The ISA compute engine computes directly with human-readable decimal character sequence floating-point representation operands without first having to explicitly perform a conversion-to-binary-format process in software. A fully pipelined convertToBinaryFromDecimalCharacter hardware operator logic circuit converts one or more human-readable decimal character sequence floating-point representations to IEEE 754-2008 binary floating-point representations every clock cycle. Following computations by at least one hardware floating-point operator, a convertToDecimalCharacterFromBinary hardware conversion circuit converts the result back to a human-readable decimal character sequence floating-point representation.
DECOUPLED ACCESS-EXECUTE PROCESSING AND PREFETCHING CONTROL
Apparatuses and methods are provided, relating to the control of data processing in devices which comprise both decoupled access-execute processing circuitry and prefetch circuitry. Control of the access portion of the decoupled access-execute processing circuitry may be dependent on a performance metric of the prefetch circuitry. Alternatively or in addition, control of the prefetch circuitry may be dependent on a performance metric of the access portion.
SPARSE MATRIX CALCULATIONS UTILIZING TIGHTLY COUPLED MEMORY AND GATHER/SCATTER ENGINE
A processor for sparse matrix calculation includes an on-chip memory, a cache, a gather/scatter engine, and a core. The on-chip memory stores a first matrix or vector, and the cache stores a compressed sparse second matrix data structure. The compressed sparse second matrix data structure includes a value array including non-zero element values of the sparse second matrix, where each entry includes a given number of element values; and a column index array where each entry includes the given number of offsets matching the value array. The gather/scatter engine gathers element values of the first matrix or vector using the column index array of the sparse second matrix. In a hybrid horizontal/vertical implementation, the gather/scatter engine gathers sets of element values from sets of rows and from different sub-banks within the same rows based on the column index array of the sparse matrix.
SYSTEM ARCHITECTURES FOR BIG DATA PROCESSING
Provided are systems and methods for big data processing and related architectures. Various embodiments include a configurable load store unit, a computational register file, and related methods, systems, and devices. Requests to utilize at least one of a memory and a storage can be received at a computing system comprising a local memory and local storage. Systems and methods can determine availability of a remote memory and a remote storage at one or more remote nodes accessible by the computing system, determine a distribution among the local memory, local storage, and one or more remote nodes to fulfill the request, and based on the determination, utilize at least one of: a memory associated with a first set of one or more remote nodes via a first interconnect, and a storage associated with a second set of one or more remote nodes via a second interconnect.