Patent classifications
G06F9/3009
Fetching Instructions in an Instruction Fetch Unit
A method in an instruction fetch unit configured to initiate a fetch of an instruction bundle from a first memory and to initiate a fetch of an instruction bundle from a second memory, wherein a fetch from the second memory takes a predetermined fixed plurality of processor cycles, the method comprising: identifying that an instruction bundle is to be selected for fetching from the second memory in a predetermined future processor cycle; and initiating a fetch of the identified instruction bundle from the second memory a number of processor cycles prior to the predetermined future processor cycle based upon the predetermined fixed plurality of processor cycles taken to fetch from the second memory
Method and system for constructing persistent memory index in non-uniform memory access architecture
A method for constructing a persistent memory index in a non-uniform memory access architecture includes: maintaining partial persistent views in a persistent memory and maintaining a global volatile view in a DRAM; an underlying persistent memory index processing a request in a foreground thread when cold data is accessed; when hot data is accessed, reading a key-value pair for a piece of hot data in the global volatile view in response to a query operation carried in the request, and in response to an insert/update/delete operation carried in the request, updating a local partial persistent view and the global volatile view; and in response to a hotspot migration, a background thread generating new partial persistent views and a new global volatile view, and recycling the partial persistent views and the global volatile view for old hot data into the underlying persistent memory index.
PERFORMING A SYNCHRONIZATION OPERATION ON AN ELECTRONIC DEVICE
In an illustrative example, a method of operation of an electronic device includes identifying a plurality of threads. Each thread of the plurality of threads is configured to execute a plurality of instructions including a barrier instruction corresponding to a target of a synchronization operation. The method further includes selecting a master thread to perform one or more operations associated with the synchronization operation. The method also includes providing an indication of a number of threads included in the plurality of threads to the master thread.
Generation and use of memory access instruction order encodings
Apparatus and methods are disclosed for controlling execution of memory access instructions in a block-based processor architecture using a hardware structure that indicates a relative ordering of memory access instruction in an instruction block. In one example of the disclosed technology, a method of executing an instruction block having a plurality of memory load and/or memory store instructions includes selecting a next memory load or memory store instruction to execute based on dependencies encoded within the block, and on a store vector that stores data indicating which memory load and memory store instructions in the instruction block have executed. The store vector can be masked using a store mask. The store mask can be generated when decoding the instruction block, or copied from an instruction block header. Based on the encoded dependencies and the masked store vector, the next instruction can issue when its dependencies are available.
Method and apparatus for vector based finite impulse response (FIR) filtering
A method is provided that includes performing, by a processor in response to a vector finite impulse response (VFIR) filter instruction, generating of a plurality of filter outputs using a plurality of coefficients and a plurality of sequential data elements, the plurality of coefficients specified by a coefficient operand of the VFIR filter instruction and the plurality of sequential data elements specified by a data operand of the VFIR filter instruction, and storing the filter outputs in a storage location specified by the VFIR filter instruction.
Scheduling resuming of ready to run virtual processors in a distributed system
Dynamic scheduling is disclosed. A plurality of physical nodes is included in a computer system. Each node includes a plurality of processors. Each processor includes a plurality of hyperthreads. In response to receiving an indication of an event occurring, a search is performed for a queue in a set of queues on which to place a virtual processor that had been waiting on the event. Queues in the set of queues correspond to hyperthreads in a physical node in the plurality of physical nodes. The queues in the set of queues are visited according to a predetermined traversal order.
ASYNCHRONOUS SEQUENTIAL PROCESSING EXECUTION
The described technology provides a system and method for sequential execution of one or more operation segments in an asynchronous event driven architecture. One or more operation segments may be associated and grouped into an activity of operation segments. The operation segments of an activity may be sequentially executed based on a queue structure of references to operation segments stored in a context memory associated with the activity. Any initiated operation segment may be placed on the queue structure upon completion of an associated I/O action.
TECHNIQUES FOR EFFICIENTLY TRANSFERRING DATA TO A PROCESSOR
A technique for block data transfer is disclosed that reduces data transfer and memory access overheads and significantly reduces multiprocessor activity and energy consumption. Threads executing on a multiprocessor needing data stored in global memory can request and store the needed data in on-chip shared memory, which can be accessed by the threads multiple times. The data can be loaded from global memory and stored in shared memory using an instruction which directs the data into the shared memory without storing the data in registers and/or cache memory of the multiprocessor during the data transfer.
Ticket Locks with Enhanced Waiting
A computer comprising one or more processors and memory may implement multiple threads that perform a lock operation using a data structure comprising an allocation field and a grant field. Upon entry to a lock operation, a thread allocates a ticket by atomically copying a ticket value contained in the allocation field and incrementing the allocation field. The thread compares the allocated ticket to the grant field. If they are unequal, the thread determines a number of waiting threads. If the number is above the threshold, the thread enters a long term wait operation comprising determining a location for long term wait value and waiting on changes to that value. If the number is below the threshold or the long term wait operation is complete, the thread waits for the grant value to equal the ticket to indicate that the lock is allocated.
APPLICATION PROGRAMMING INTERFACE TO LIMIT MEMORY
Apparatuses, systems, and techniques to limit memory during execution of one or more kernels and/or thread groups during PPU execution. In at least one embodiment, a process indicates a memory limit for one or more kernels and/or thread groups to a parallel processing library, and said parallel processing library restricts memory allocation for said one or more kernels and/or thread groups according to said memory limit.