Patent classifications
G06F9/30065
Stream reference register with double vector and dual single vector operating modes
A streaming engine employed in a digital signal processor specifies a fixed read only data stream. Once fetched the data stream is stored in two head registers for presentation to functional units in the fixed order. Data use by the functional unit is preferably controlled using the input operand fields of the corresponding instruction. A first read only operand coding supplies data from the first head register. A first read/advance operand coding supplies data from the first head register and also advances the stream to the next sequential data elements. Corresponding second read only operand coding and second read/advance operand coding operate similarly with the second head register. A third read only operand coding supplies double width data from both head registers.
INSTRUCTION TO VECTORIZE LOOPS WITH BACKWARD CROSS-ITERATION DEPENDENCIES
Methods and apparatus relating to techniques for vectorizing loops with backward cross-iteration dependencies are described. In an embodiment, execution of one or more instructions resolves a cross-iteration dependency of one or more operations of a loop. The execution of the one or more instructions resolves the cross-iteration dependency of the one or more operations based at least in part on one or more distance count computations to a preceding iteration of the loop. Other embodiments are also disclosed and claimed.
UNIVERSAL PERIPHERAL EXTENDER ARCHITECTURE, SYSTEM, AND METHOD
A universal peripheral extender architecture, system, and method is disclosed that addresses the need of communicatively connecting peripheral I/O devices and the smart host devices in legacy, medical, and industrial applications. As disclosed, a universal peripheral extender includes an I/O device translation & management module that has a device-side utility, a host-side I/O device translation & management utility, and a host/device translation & management scheduler utility.
LOOP EXECUTION IN A RECONFIGURABLE COMPUTE FABRIC
Various examples are directed to systems and methods for executing a loop in a reconfigurable compute fabric. A first flow controller may initiate a first thread at a first synchronous flow to execute a first portion of a first iteration of the loop. A second flow controller may receive a first asynchronous message instructing the second flow controller to initiate a first thread at a second synchronous flow to execute a second portion of the first iteration. The second flow controller may determine that the first iteration of the loop is the last iteration of the loop to be executed and initiate the first thread at the second synchronous flow with a last iteration flag set.
Repeat Instruction for Loading and/or Executing Code in a Claimable Repeat Cache a Specified Number of Times
A processor is disclosed including: a barrel-threaded execution unit for executing concurrent threads, and a repeat cache shared between the concurrent threads. The processor's instruction set includes a repeat instruction which takes a repeat count operand. When the repeat cache is not claimed and the repeat instruction is executed in a first thread, a portion of code is cached from the first thread into the repeat cache, the state of the repeat cache is changed to record it as claimed, and the cached code is executed a number of times. When the repeat instruction is then executed in a further thread, then the already-cached portion of code is again executed a respective number of times, each time from the repeat cache. For each of the first and further instructions, the repeat count operand in the respective instruction specifies the number of times to execute the cached code.
STREAM REFERENCE REGISTER WITH DOUBLE VECTOR AND DUAL SINGLE VECTOR OPERATING MODES
A streaming engine employed in a digital signal processor specifies a fixed read only data stream. Once fetched the data stream is stored in two head registers for presentation to functional units in the fixed order. Data use by the functional unit is preferably controlled using the input operand fields of the corresponding instruction. A first read only operand coding supplies data from the first head register. A first read/advance operand coding supplies data from the first head register and also advances the stream to the next sequential data elements. Corresponding second read only operand coding and second read/advance operand coding operate similarly with the second head register. A third read only operand coding supplies double width data from both head registers.
Offloading execution of a multi-task parameter-dependent operation to a network device
A network device includes a network interface, a host interface and processing circuitry. The network interface is configured to connect to a communication network. The host interface is configured to connect to a host including a processor. The processing circuitry is configured to receive from the processor, via the host interface, a notification specifying an operation for execution by the network device, the operation including (i) multiple tasks that are executable by the network device, and (ii) execution dependencies among the tasks in response to the notification, the processing circuitry is configured to determine a schedule for executing the tasks, the schedule complying with the execution dependencies, and to execute the operation by executing the tasks of the operation is accordance with the schedule.
Multithreaded processor core with hardware-assisted task scheduling
Embodiments of apparatuses, methods, and systems for scheduling tasks to hardware threads are described. In an embodiment, a processor includes a multiple hardware threads and a task manager. The task manager is to issue a task to a hardware thread. The task manager includes a hardware task queue to store a descriptor for the task. The descriptor is to include a field to store a value to indicate whether the task is a single task, a collection of iterative tasks, and a linked list of tasks.
Tuning of loop orders in blocked dense basic linear algebra subroutines
An example includes a sequence generator to generate a plurality of sequence pairs, a first one of the sequence pairs including: (i) a first input sequence representing first accesses to first tensors in a first loop nest of a first computer program, and (ii) a first output sequence representing a first tuned loop nest corresponding to the first accesses to the first tensors in the first loop nest; a model trainer to train a recurrent neural network based on the sequence pairs as training data, the recurrent neural network to be trained to tune loop ordering of a second computer program based on a second input sequence representing second accesses to a second tensor in a second loop nest of the second computer program; and a memory interface to store, in memory, a trained model corresponding to the recurrent neural network.
Method, device and computer program product for event ordering
Event ordering is provided in a distributed file system. For instance, events are generated that are associated with an object collected from nodes in the distributed file system, and an event loop indicates causal dependencies among the events, and comprises one or more reliable edges and unreliable edges. Degrees of reliability of the unreliable edges in the event loop are determined, and then at least one unreliable edge is removed from the event loop based on the determined degrees of reliability. Causal dependencies among the events in the distributed file system are analyzed by using a statistical method, and the most unreliable edge in the event loop can be removed by computing a degree of reliability of each unreliable edge, thereby avoiding the occurrence of the event loop.