G06F9/3004

Lazy push strategies for vectorized D-Heaps

Techniques are provided for lazy push optimization, allowing for constant time push operations. A d-heap is used as the underlying data structure for indexing values being inserted. The d-heap is vectorized by storing values in a contiguous memory array. Heapify operations are delayed until a retrieve operation occurs, improving insert performance of vectorized d-heaps that use horizontal aggregation SIMD instructions at the cost of slightly lower retrieve performance.

ISA ACCESSIBLE PHYSICAL UNCLONABLE FUNCTION

Techniques for encrypting data using a key generated by a physical unclonable function (PUF) or a virtual PUF key are described. An apparatus according to the present disclosure may include decoder circuitry to decode an instance of a single instruction having a field for an opcode to indicate that execution circuitry is to encrypt at least encrypt secret information from an input data structure with either a physical unclonable function (PUF) generated encryption key or a virtual PUF key, bind the wrapped secret information to an identified target, update the input data structure, generate a MAC over the updated data structure, store the MAC in the input data structure to generate a wrapped output data structure, store the wrapped output data structure having the encrypted secret information and an indication of the target;

Implementing state-based frame barriers to process colorless roots during concurrent execution

An application thread executes concurrently with a garbage collection (GC) thread traversing a call stack of the application thread. Frames of the call stack that have been processed by the GC thread assume a global state associated with the GC thread. The application thread may attempt to return to a target frame that has not yet assumed the global state. The application thread hits a frame barrier, preventing return to the target frame. The application thread determines a frame state of the target frame. The application thread selects appropriate operations for bringing the target frame to the global state based on the frame state. The selected operations are performed to bring the target frame to the global state. The application thread returns to the target frame.

Using a vector processor to configure a direct memory access system for feature tracking operations in a system on a chip

In various examples, a VPU and associated components may be optimized to improve VPU performance and throughput. For example, the VPU may include a min/max collector, automatic store predication functionality, a SIMD data path organization that allows for inter-lane sharing, a transposed load/store with stride parameter functionality, a load with permute and zero insertion functionality, hardware, logic, and memory layout functionality to allow for two point and two by two point lookups, and per memory bank load caching capabilities. In addition, decoupled accelerators may be used to offload VPU processing tasks to increase throughput and performance, and a hardware sequencer may be included in a DMA system to reduce programming complexity of the VPU and the DMA system. The DMA and VPU may execute a VPU configuration mode that allows the VPU and DMA to operate without a processing controller for performing dynamic region based data movement operations.

SYNCHRONIZING SCHEDULING TASKS WITH ATOMIC ALU
20230033355 · 2023-02-02 ·

A method of synchronizing a group of scheduled tasks within a parallel processing unit into a known state is described. The method uses a synchronization instruction in a scheduled task which triggers, in response to decoding of the instruction, an instruction decoder to place the scheduled task into a non-active state and forward the decoded synchronization instruction to an atomic ALU for execution. When the atomic ALU executes the decoded synchronization instruction, the atomic ALU performs an operation and check on data assigned to the group ID of the scheduled task and if the check is passed, all scheduled tasks having the particular group ID are removed from the non-active state.

RETURN VALUE STORAGE FOR ATOMIC FUNCTIONS
20230095207 · 2023-03-30 ·

A memory architecture may provide support for any number of direct memory access (DMA) operations at least partially independent of the CPU coupled to the memory. DMA operations may involve data movement between two or more memory locations and may involve minor computations. At least some DMA operations may include any number of atomic functions, and at least some of the atomic functions may include a corresponding return value. A system includes a first direct memory access (DMA) engine to request a DMA operation. The DMA operation includes an atomic operation. The system also includes a second DMA engine to receive a return value associated with the atomic operation and store the return value at a source memory.

APPARATUS AND METHOD FOR SECURE, EFFICIENT MICROCODE PATCHING

An apparatus and method for efficient microcode patching. For example, one embodiment of an apparatus comprises: a package comprising one or more integrated circuit dies, the one or more integrated circuit dies comprising: a plurality of cores; and a security controller coupled to the plurality of cores, a first core of the plurality of cores comprising: a decoder to decode a microcode patching instruction, the microcode patching instruction comprising an operand to be used to identify an address; and execution circuitry to execute the microcode patching instruction, wherein responsive to the microcode patching instruction, the execution circuitry and/or security controller are to: retrieve a microcode patch from a location in memory based on the address, validate the microcode patch, apply the microcode patch to update or replace microcode associated with the one or more integrated circuit dies, and transmit the microcode patch to a persistent storage device; wherein the microcode patch is to be subsequently retrieved from the persistent storage device by one or more external security controllers of one or more external integrated circuit dies, the one or more external security controllers to cause the microcode patch to be applied to update or replace microcode associated with the one or more external integrated circuit dies.

MEMORY TAGGING AND TRACKING FOR OFFLOADED FUNCTIONS AND CALLED MODULES

An apparatus includes a first processor to be communicatively coupled to a main memory having instructions stored therein. The first processor is to execute the instructions to assign a first tag to a plurality of granules in a first portion of memory allocated for an offloaded function invoked by a module running on a second processor, detect an exception raised for a tag check failure for a memory access operation based on a first memory address in the first portion of the memory, and update a modified address list to include information associated with the first memory address. The instructions are executed further to synchronize, based on the modified address list, a second portion of the memory allocated to the module with the first portion of the memory.

NPU IMPLEMENTED FOR ARTIFICIAL NEURAL NETWORKS TO PROCESS FUSION OF HETEROGENEOUS DATA RECEIVED FROM HETEROGENEOUS SENSORS
20220348229 · 2022-11-03 · ·

A neural processing unit (NPU) includes a controller including a scheduler, the controller configured to receive from a compiler a machine code of an artificial neural network (ANN) including a fusion ANN, the machine code including data locality information of the fusion ANN, and receive heterogeneous sensor data from a plurality of sensors corresponding to the fusion ANN; at least one processing element configured to perform fusion operations of the fusion ANN including a convolution operation and at least one special function operation; a special function unit (SFU) configured to perform a special function operation of the fusion ANN; and an on-chip memory configured to store operation data of the fusion ANN, wherein the schedular is configured to control the at least one processing element and the on-chip memory such that all operations of the fusion ANN are processed in a predetermined sequence according to the data locality information.

EFFICIENT COMPRESSED VERBATIM COPY

Compressed verbatim copy can enable more efficient copying of compressed data. In one example, a compressed verbatim copy method involves receiving a command to copy compressed data from a source address of the memory device to a destination address. In response to the receipt of the command, the method involves copying the compressed data in a compressed format from the source address to the destination address without first decompressing the data. A second source address and a second destination address of metadata for the compressed data is determined, and the metadata is copied from the second source address to the second destination address.