Patent classifications
G06F9/3009
Non-cached loads and stores in a system having a multi-threaded, self-scheduling processor
Representative apparatus, method, and system embodiments are disclosed for a self-scheduling processor which also provides additional functionality. Representative embodiments include a self-scheduling processor, comprising: a processor core adapted to execute instructions; and a core control circuit adapted to automatically schedule an instruction for execution by the processor core in response to a received work descriptor data packet. In a representative embodiment, the processor core is further adapted to execute a non-cached load instruction to designate a general purpose register rather than a data cache for storage of data received from a memory circuit. The core control circuit is also adapted to schedule a fiber create instruction for execution by the processor core, and to generate one or more work descriptor data packets to another circuit for execution of corresponding execution threads. Event processing, data path management, system calls, memory requests, and other new instructions are also disclosed.
METHOD AND APPARATUS FOR VECTOR SORTING USING VECTOR PERMUTATION LOGIC
A method for sorting of a vector in a processor is provided that includes performing, by the processor in response to a vector sort instruction, generating a control input vector for vector permutation logic comprised in the processor based on values in lanes of the vector and a sort order for the vector indicated by the vector sort instruction and storing the control input vector in a storage location.
Hypervisor task execution management for virtual machines
A system enabling a hypervisor to assign processor resources for specific tasks to be performed by a virtual machine. An example method may comprise: receiving, by a hypervisor running on a host computer system, a virtual processor (“vCPU”) assignment request from a virtual device driver running on a virtual machine managed by the hypervisor, assigning a vCPU for executing a task associated with the assignment request, and causing the virtual device driver to execute the task using the vCPU.
Control registers to store thread identifiers for threaded loop execution in a self-scheduling reconfigurable computing fabric
Representative apparatus, method, and system embodiments are disclosed for configurable computing. A representative system includes an interconnection network; a processor; and a plurality of configurable circuit clusters. Each configurable circuit cluster includes a plurality of configurable circuits arranged in an array; a synchronous network coupled to each configurable circuit of the array; and an asynchronous packet network coupled to each configurable circuit of the array. A representative configurable circuit includes a configurable computation circuit and a configuration memory having a first, instruction memory storing a plurality of data path configuration instructions to configure a data path of the configurable computation circuit; and a second, instruction and instruction index memory storing a plurality of spoke instructions and data path configuration instruction indices for selection of a master synchronous input, a current data path configuration instruction, and a next data path configuration instruction for a next configurable computation circuit.
COOPERATIVE GARBAGE COLLECTION BARRIER ELISION
Techniques are disclosed for eliding load and store barriers while maintaining garbage collection invariants. Embodiments described herein include techniques for identifying an instruction, such as a safepoint poll, that checks whether to pause a thread between execution of a dominant and dominated access to the same data field. If a poll instruction is identified between the two data accesses, then a pointer for the data field may be recorded in an entry associated with the poll instruction. When the thread is paused to execute a garbage collection operation, the recorded information may be used to update values associated with the data field in memory such that the dominated access may be executed without any load or store barriers.
METHOD AND APPARATUS TO SORT A VECTOR FOR A BITONIC SORTING ALGORITHM
A method is provided that includes performing, by a processor in response to a vector sort instruction, sorting of values stored in lanes of the vector to generate a sorted vector, wherein the values in a first portion of the lanes are sorted in a first order indicated by the vector sort instruction and the values in a second portion of the lanes are sorted in a second order indicated by the vector sort instruction; and storing the sorted vector in a storage location.
Reducing file system consistency check downtime
Provided is a method for performing a file system consistency check. The method comprises calculating, by a first thread that does not have access to an inode table, file block addresses for one or more files to be checked by the thread. The method further comprises collecting validity information for the one or more files. The method further comprises reading information relating to the one or more files from the inode table. The reading is performed in response to the thread being given access to the inode table after the calculating operation. The method further comprises validating the information by comparing the information from the inode table to the validity information.
LOCK FREE HIGH THROUGHPUT RESOURCE STREAMING
Methods, systems and apparatuses may provide for technology that conducts, via a plurality of concurrent threads, transfers of graphics resources into and out of graphics memory, wherein the transfers bypass lock operations between the plurality of concurrent threads, generates frames based on the graphics resources in the graphics memory, and streams the frames to a display. In one example, the transfers also bypass explicit wait operations for the graphics resources to be fully resident in the graphics memory.
MEMORY BARRIER ELISION FOR MULTI-THREADED WORKLOADS
A system includes a memory, at least one physical processor in communication with the memory, and a plurality of threads executing on the at least one physical processor. A first thread of the plurality of threads is configured to execute a plurality of instructions that includes a restartable sequence. Responsive to a different second thread in communication with the first thread being pre-empted while the first thread is executing the restartable sequence, the first thread is configured to restart the restartable sequence prior to reaching a memory barrier.
Compiler-optimized context switching with compiler-inserted data table for in-use register identification at a preferred preemption point
Compiler-optimized context switching may include receiving an instruction indicating a preferred preemption point comprising an instruction address; storing the preferred preemption point in a data structure; determining, based on the data structure, that the preferred preemption point has been reached by a first thread; determining that preemption of the first thread for a second thread has been requested; and performing a context switch to the second thread.