Patent classifications
G06F9/544
Read-write page replication for multiple compute units
In general, an application executes on a compute unit, such as a central processing unit (CPU) or graphics processing unit (GPU), to perform some function(s). In some circumstances, improved performance of an application, such as a graphics application, may be provided by executing the application across multiple compute units. However, when using multiple compute units in this manner, synchronization must be provided between the compute units. Synchronization, including the sharing of the data, is typically accomplished through memory. While a shared memory may cause bottlenecks, employing local memory for each compute unit may itself require synchronization (coherence) which can be costly in terms of resources, delay, etc. The present disclosure provides read-write page replication for multiple compute units that avoids the traditional challenges associated with coherence.
Parallel scheduling of encryption engines and decryption engines to prevent side channel attacks
This disclosure describes systems on a chip (SOCs) that prevent side channel attacks on encryption and decryption engines of an electronic device. The SoCs of this disclosure concurrently operate key-diverse encryption and decryption datapaths to obfuscate the power trace signature exhibited by the device that includes the SoC. An example SoC includes an encryption engine configured to encrypt transmission (Tx) channel data using an encryption key and a decryption engine configured to decrypt encrypted received (Rx) channel data using a decryption key that is different from the encryption key. The SoC also includes a scheduler configured to establish concurrent data availability between the encryption and decryption engines and activate the encryption engine and the decryption engine to cause the encryption engine to encrypt the Tx channel data concurrently with the decryption engine decrypting the encrypted Rx channel data using the decryption key that is different from the encryption key.
Systems, methods, and devices for queue availability monitoring
A method may include determining, with a queue availability module, that an entry is available in a queue, asserting a bit in a register based on determining that an entry is available in the queue, determining, with a processor, that the bit is asserted, and processing, with the processor, the entry in the queue based on determining that the bit is asserted. The method may further include storing the register in a tightly coupled memory associated with the processor. The method may further include storing the queue in the tightly coupled memory. The method may further include determining, with the queue availability module, that an entry is available in a second queue, and asserting a second bit in the register based on determining that an entry is available in the second queue. The method may further include finding the first bit in the register using a find first instruction.
Index space mapping using static code analysis
A method for computing includes providing software source code defining a processing pipeline including multiple, sequential stages of parallel computations, in which a plurality of processors apply a computational task to data read from a buffer. A static code analysis is applied to the software source code so as to break the computational task into multiple, independent work units, and to define an index space in which the work units are identified by respective indexes. Based on the static code analysis, mapping parameters that define a mapping between the index space and addresses in the buffer are computed, indicating by the mapping the respective ranges of the data to which the work units are to be applied. The source code is compiled so that the processors execute the work units identified by the respective indexes while accessing the data in the buffer in accordance with the mapping.
SYSTEMS, METHODS, AND DEVICES FOR QUEUE AVAILABILITY MONITORING
A method may include determining, with a queue availability module, that an entry is available in a queue, asserting a bit in a register based on determining that an entry is available in the queue, determining, with a processor, that the bit is asserted, and processing, with the processor, the entry in the queue based on determining that the bit is asserted. The method may further include storing the register in a tightly coupled memory associated with the processor. The method may further include storing the queue in the tightly coupled memory. The method may further include determining, with the queue availability module, that an entry is available in a second queue, and asserting a second bit in the register based on determining that an entry is available in the second queue. The method may further include finding the first bit in the register using a find first instruction.
VARIABLE PIPELINE LENGTH IN A BARREL-MULTITHREADED PROCESSOR
Devices and techniques for variable pipeline length in a barrel-multithreaded processor are described herein. A completion time for an instruction can be determined prior to insertion into a pipeline of a processor. A conflict between the instruction and a different instruction based on the completion time can be detected. Here, the different instruction is already in the pipeline and the conflict detected when the completion time equals the previously determined completion time for the different instruction. A difference between the completion time and an unconflicted completion time can then be calculated and completion of the instruction delayed by the difference.
BUBBLE SORTING FOR SCHEDULING TASK EXECUTION IN COMPUTING SYSTEMS
One or more embodiments of the present disclosure relate to determining a first execution schedule for execution of a plurality of runnables, the plurality of runnables corresponding to a process executed using a plurality of compute engines. Additionally or alternatively, one or more embodiments may relate to modifying the first execution schedule to generate a second execution schedule. The modifying may include moving one or more runnables of the plurality of runnables to populate one or more gaps in the first execution schedule. The moving of the one or more runnables may be performed in view of one or more moving constraints.
WAVE THROTTLING BASED ON A PARAMETER BUFFER
A graphics pipeline includes a first shader that generates first wave groups, a shader processor input (SPI) that launches the first wave groups for execution by shaders, and a scan converter that generates second waves for execution on the shaders based on results of processing the first wave groups the one or more shaders. The first wave groups are selectively throttled based on a comparison of in-flight first wave groups and second waves pending execution on the at least one second shader. A cache holds information that is written to the cache in response to the first wave groups finishing execution on the shaders. Information is read from the cache in response to read requests issued by the second waves. In some cases, the first wave groups are selectively throttled by comparing how many first wave groups are in-flight and how many read requests to the cache are pending.
Distributing matrix multiplication processing among processing nodes
Based on a predetermined number of available processor sockets, a plurality of candidate matrix decompositions are identified, which correspond to a multiplication of matrices. Based on a first comparative relationship of a variation of first sizes of the plurality of candidate matrix decompositions along a first dimension and a second comparative relationship of a variation of second sizes of the plurality of candidate matrix decomposition sizes along a second dimension, a given candidate matrix decomposition is selected. Processing of the multiplication among the processor sockets is distributed based on the given candidate matrix decomposition.
I/O to unpinned memory supporting memory overcommit and live migration of virtual machines
Systems and methods of error handling in a network interface card (NIC) are provided. For a data packet destined for a local virtual machine (VM), if the NIC cannot determine a valid translation memory address for a virtual memory address in a buffer descriptor from a receive queue of the VM, the NIC can retrieve a backup buffer descriptor from a hypervisor queue, and store the packet in a host memory location indicated by an address in the backup buffer descriptor. For a transmission request from a local VM, if the NIC cannot determine a valid translated address for a virtual memory address in the packet descriptor from a transmit queue of the VM, the NIC can send a message to a hypervisor backup queue, and generate and transmit a data packet based on data in a memory page reallocated by the hypervisor.