Patent classifications
G06F8/4442
METHOD AND DEVICE FOR PROVIDING A VECTOR STREAM INSTRUCTION SET ARCHITECTURE EXTENSION FOR A CPU
A method and device for providing a vector stream instruction set architecture extension for a CPU. In one aspect, there is provided a vector stream engine unit comprising: a first fast memory storage for temporarily storing data of vector data streams from a memory for loading into a vector register file; a second fast memory storage for temporarily storing data of the vector data streams from the vector register file for loading into the memory; a prefetcher configured to prefetch data of the vector data streams from the memory into the first fast storage memory, and prefetch data of the vector data streams from the vector register file into the second fast storage memory; and a stream configuration table (SCT) storing stream information for prefetching data from the vector data streams.
METHOD AND SYSTEM FOR PROVIDING A CONTEXT-SENSITIVE, NON-INTRUSIVE DATA PROCESSING OPTIMIZATION FRAMEWORK
A method of performing a data search in a data source by which an operator of a data search pipeline is just-in-time optimized and compiled, using an operator optimization module which optimizes and compiles an intermediate representation of the operator, considering runtime information, and optimization rules, to produce an operator that is optimized for the data search being performed. The method can be applied with one operator or with many operators applied in any sequence or tree structure according to a query plan, as determined by runtime information and optimization rules.
Digitally coordinated dynamically adaptable clock and voltage supply apparatus and method
An apparatus and method is described that digitally coordinates dynamically adaptable clock and voltage supply to significantly reduce the energy consumed by a processor without impacting its performance or latency. A signal is generated that indicates a long latency operation. This signal is used to reduce power supply voltage and frequency of the adaptable clock. An early resume indicator is generated a few nanoseconds before normal operations are about to resume. This early resume signal is used to power up the power-downed voltage regulator, and/or can increase frequency and/or supply voltage back to normal level before normal processor operations are about to resume.
Prefetch kernels on data-parallel processors
Embodiments include methods, systems and non-transitory computer-readable computer readable media including instructions for executing a prefetch kernel with reduced intermediate state storage resource requirements. These include executing a prefetch kernel on a graphics processing unit (GPU), such that the prefetch kernel begins executing before a processing kernel. The prefetch kernel performs memory operations that are based upon at least a subset of memory operations in the processing kernel.
Circuitry with adaptive memory assistance capabilities
A system for running one or more applications is provided. Each application may require memory services that can be accelerated using configurable memory assistance circuits associated with different levels of a memory hierarchy. Integrated circuit design tools may be used to generate configuration data for programming the configurable memory assistance circuits. During compile time, the design tools may identify memory service patterns in a source code, match the identified memory service patterns to corresponding templates, parameterize the matching templates, and then synthesize the parameterized templates to produce the configuration data. During run time, a memory assistance scheduler may map the memory services required by each application to available memory assistance circuits in the system. The mapped memory assistance circuits are programmed by the configuration data to provide the desired memory service capability.
ANALYSIS FOR MODELING DATA CACHE UTILIZATION
Aspects include modeling data cache utilization for each loop in a loop nest; estimating total data cache lines fetched in one iteration of the loop; and determining the possibility of data cache reuse across loop iterations using data cache lines fetched and associativity constraints. Aspects also include estimating, for memory reference pairs, reuse by one reference of data cache line fetched by another; estimating total number of cache misses for all iterations of the loop; and estimating total number of cache misses of a reference for iterations of a next outer loop as equal to total cache misses for an entire inner loop. Aspects further include estimating memory cost of a loop unroll and jam transformation, without performing the transformation; and extending a data cache model to estimate best unroll-and-jam factors for the loop nest, capable of minimizing total cache misses incurred by the memory references in the loop body.
Analysis for modeling data cache utilization
Aspects include modeling data cache utilization for each loop in a loop nest; estimating total data cache lines fetched in one iteration of the loop; and determining the possibility of data cache reuse across loop iterations using data cache lines fetched and associativity constraints. Aspects also include estimating, for memory reference pairs, reuse by one reference of data cache line fetched by another; estimating total number of cache misses for all iterations of the loop; and estimating total number of cache misses of a reference for iterations of a next outer loop as equal to total cache misses for an entire inner loop. Aspects further include estimating memory cost of a loop unroll and jam transformation, without performing the transformation; and extending a data cache model to estimate best unroll-and-jam factors for the loop nest, capable of minimizing total cache misses incurred by the memory references in the loop body.
PREFETCH KERNELS ON DATA-PARALLEL PROCESSORS
Embodiments include methods, systems and non-transitory computer-readable computer readable media including instructions for executing a prefetch kernel that includes memory accesses for prefetching data for a processing kernel into a memory, and, subsequent to executing at least a portion of the prefetch kernel, executing the processing kernel where the processing kernel includes accesses to data that is stored into the memory resulting from execution of the prefetch kernel.
Shared Compilation Cache Verification System
Example embodiments of the present disclosure provide, in one example aspect, an example computer-implemented method for verification of a shared cache. The example method can include retrieving a precompiled shared cache entry corresponding to a shared cache key, the shared cache key being associated with an operation request. The example method can include obtaining a directly compiled resource associated with the operation request. The example method can include certifying one or more portions of the shared cache based at least in part on a comparison of the precompiled shared cache entry and the directly compiled resource.
EFFICIENT AND CONCURRENT MODEL EXECUTION
An accelerator is disclosed. A circuit may process a data to produce a processed data. A first tier storage may include a first capacity and a first latency. A second tier storage may include a second capacity and a second latency. The second capacity may be larger than the first capacity, and the second latency may be slower than the first latency. A bus may be used to transfer at least one of the data or the processed data between the first tier storage and the second tier storage.