G06F8/4442

Method and device for distributing partitions on a multicore processor
11175963 · 2021-11-16 · ·

A method and a device for distributing partitions of a sequence of partitions on the cores of a multicore processor are provided. The method makes it possible to identify parameters characterizing the hardware architecture of a multicore processor, and parameters characterizing an initial ordering of the partitions of a sequence; and then to profile and classify each partition of the sequence in order to assign the execution of each partition to a core of the multicore processor while maintaining the initial sequential ordering of the partitions.

Systems and methods for data processing

A method for data processing is provided. The method may include: preprocessing initial data to obtain preprocessed data; storing the preprocessed data; receiving a data request made through an application, the data request including information relating to a storage path of contents that are requested; in response to the data request, determining, by a nearby proxy of a first proxy cluster in a first region, whether the contents requested in the data request are cached locally; and in response to a determination that the contents are cached locally, providing, by the nearby proxy, the contents to the application; or in response to a determination that the contents are not cached locally, acquiring, by the nearby proxy, the contents based on the information relating to the storage path of the contents; and providing, by the nearby proxy, the contents to the application.

CODE PREFETCH INSTRUCTION

Embodiments of apparatuses, methods, and systems for code prefetching are described. In an embodiment, an apparatus includes an instruction decoder, load circuitry, and execution circuitry. The instruction decoder is to decode a code prefetch instruction. The code prefetch instruction is to specify a first instruction to be prefetched. The load circuitry to prefetch the first instruction in response to the decoded code prefetch instruction. The execution circuitry is to execute the first instruction at a fetch stage of a pipeline.

Systems and methods for array structure processing

A compiler optimization for structure peeling an array of structures (AOS) into a structure of arrays (SOA) by which a pointer to an array in the original program, is transformed into a tagged index that includes both an array index, and a memory identifier tagging the array index. Once processed by the compiler, each array index is identified by a respective memory identifier, hence if the program instructions call for redefining an array during run time, its array element can still be retrieved by referring to the memory identifier it is tagged with.

Methods and apparatus for intra-wave texture looping
11640647 · 2023-05-02 · ·

The present disclosure relates to methods and devices for graphics processing including an apparatus, e.g., a GPU. The apparatus may determine whether to divide a group of threads into a plurality of sub-groups of threads, each thread of the group of threads being associated with a shader program. The apparatus may also divide, upon determining to divide the group of threads into the plurality of sub-groups of threads, the group of threads into the plurality of sub-groups of threads. Additionally, the apparatus may execute, upon dividing the group of threads into the plurality of sub-groups of threads, a subsection of the shader program for each sub-group of threads of the plurality of sub-groups of threads.

Shared Compilation Cache Verification System
20230350656 · 2023-11-02 ·

Example embodiments of the present disclosure provide, in one example aspect, an example computer-implemented method for verification of a shared cache. The example method can include retrieving a precompiled shared cache entry corresponding to a shared cache key, the shared cache key being associated with an operation request. The example method can include obtaining a directly compiled resource associated with the operation request. The example method can include certifying one or more portions of the shared cache based at least in part on a comparison of the precompiled shared cache entry and the directly compiled resource.

Instruction prefetch mechanism

An apparatus to facilitate data prefetching is disclosed. The apparatus includes a cache, one or more execution units (EUs) to execute program code, prefetch logic to maintain tracking information of memory instructions in the program code that trigger a cache miss and compiler logic to receive the tracking information, insert one or more pre-fetch instructions in updated program code to prefetch data from a memory for execution of one or more of the memory instructions that triggered a cache miss and download the updated program code for execution by the one or more EUs.

Removing branching paths from a computer program
11567744 · 2023-01-31 · ·

Methods and systems are described for removing branches from a computer program. The system receives code for a computer program, with the code including a number of branches. Each branch is part of a branching path and includes a jump instruction. The system executes the code, and upon encountering a branching path at runtime, the system proceeds with a number of steps. First, the system computes the result of the branch, then prefetches independent instructions outside of the branch to be executed. The system then executes one or more of the prefetched independent instructions and removes an if statement within the jump instruction of the branch at the computed result of the branching path. The system then executes the jump instruction of the branch at the computed result of the branching path.

METHODS AND APPARATUS FOR INTRA-WAVE TEXTURE LOOPING
20220284537 · 2022-09-08 ·

The present disclosure relates to methods and devices for graphics processing including an apparatus, e.g., a GPU. The apparatus may determine whether to divide a group of threads into a plurality of sub-groups of threads, each thread of the group of threads being associated with a shader program. The apparatus may also divide, upon determining to divide the group of threads into the plurality of sub-groups of threads, the group of threads into the plurality of sub-groups of threads. Additionally, the apparatus may execute, upon dividing the group of threads into the plurality of sub-groups of threads, a subsection of the shader program for each sub-group of threads of the plurality of sub-groups of threads.

Systems and methods for increased bandwidth utilization regarding irregular memory accesses using software pre-execution

Systems and methods are configured to receive code containing an original loop that includes irregular memory accesses. The original loop can be split. A pre-execution loop that contains code to prefetch content of the memory can be generated. Execution of the pre-execution loop can access memory inclusively between a starting location and the starting location plus a prefetch distance. A modified loop that can perform at least one computation based on the content prefetched with execution of the pre-execution loop can be generated. Execution of the main loop can to follow the execution of the pre-execution loop. The original loop can be replaced with the pre-execution loop and the modified loop.