Patent classifications
G06F2209/507
DYNAMIC UPDATE OF THE NUMBER OF ARCHITECTED REGISTERS ASSIGNED TO SOFTWARE THREADS USING SPILL COUNTS
A computer system includes a processor, main memory, and controller. The processor includes a plurality of hardware threads configured to execute a plurality of software threads. The main memory includes a first register table configured to contain a current set of architected registers for the currently running software threads. The controller is configured to change a first number of the architected registers assigned to a given one of the software threads to a second number of architected registers when a result of monitoring current usage of the registers by the software threads indicates that the change will improve performance of the computer system. The processor includes a second register table configured to contain a subset of the architected registers and a mapping table for each software thread indicating whether the architected registers referenced by the corresponding software thread are located in the first register table or the second register table.
METHOD AND SYSTEM FOR REPLICATING CORE CONFIGURATIONS
A system and method to efficiently configure an array of processing cores to perform functions of a program. A function of the program is converted to a configuration of cores. The configuration is laid out in a first subset of the array of cores. The configuration is stored. The configuration is replicated to perform the function on a second subset of the array of cores.
Management of resources within a computing environment
Resources in a computing environment are managed, for example, by a hardware controller controlling dispatching of resources from one or more pools of resources to be used in execution of threads. The controlling includes conditionally dispatching resources from the pool(s) to one or more low-priority threads of the computing environment based on current usage of resources in the pool(s) relative to an associated resource usage threshold. The management further includes monitoring resource dispatching from the pool(s) to one or more high-priority threads of the computing environment, and based on the monitoring, dynamically adjusting the resource usage threshold used in the conditionally dispatching of resources from the pool(s) to the low-priority thread(s).
Hierarchical staging areas for scheduling threads for execution
One embodiment of the present invention is a computer-implemented method for scheduling a thread group for execution on a processing engine that includes identifying a first thread group included in a first set of thread groups that can be issued for execution on the processing engine, where the first thread group includes one or more threads. The method also includes transferring the first thread group from the first set of thread groups to a second set of thread groups, allocating hardware resources to the first thread group, and selecting the first thread group from the second set of thread groups for execution on the processing engine. One advantage of the disclosed technique is that a scheduler only allocates limited hardware resources to thread groups that are, in fact, ready to be issued for execution, thereby conserving those resources in a manner that is generally more efficient than conventional techniques.
ALLOCATION OF MEMORY BY MAPPING REGISTERS REFERENCED BY DIFFERENT INSTANCES OF A TASK TO INDIVIDUAL LOGICAL MEMORIES
Methods of memory allocation in which registers referenced by different groups of instances of the same task are mapped to individual logical memories. Other example methods describe the mapping of registers referenced by a task to different banks within a single logical memory and in various examples this mapping may take into consideration which bank is likely to be the dominant bank for the particular task and the allocation for one or more other tasks.
PROVIDING INTERRUPT SERVICE ROUTINE (ISR) PREFETCHING IN MULTICORE PROCESSOR-BASED SYSTEMS
Providing interrupt service routine (ISR) prefetching in multicore processor-based systems is disclosed. In one aspect, a multicore processor-based system provides an ISR prefetch control circuit communicatively coupled to an interrupt controller and a plurality of instruction fetch units (IFUs) of a corresponding plurality of processor elements (PEs). Upon receiving an interrupt directed to a target PE of the plurality of PEs, the interrupt controller provides an interrupt request (IRQ) identifier to the ISR prefetch control circuit. Based on the IRQ identifier, the ISR prefetch control circuit fetches an ISR pointer to an ISR corresponding to the IRQ identifier. The ISR prefetch control circuit next selects a prefetch PE of the plurality of PEs to perform a prefetch operation to retrieve the ISR on behalf of the target PE, and provides an ISR prefetch request, including the ISR pointer, to an IFU of the prefetch PE.
Multiple-patch SIMD dispatch mode for domain shaders
To use SIMD lanes efficiently for domain shader execution, domain point data from different domain shader patches may be packed together into a single SIMD thread. To generate an efficient code sequence, each domain point occupies one SIMD lane and all attributes for the domain point reside in their own partition of General Register File (GRF) space. This technique is called the multiple-patch SIMD dispatch mode.
Computing Device for Fast Weighted Sum Calculation in Neural Networks
A computing device for fast weighted sum calculation in neural networks is disclosed. The computing device comprises an array of processing elements configured to accept an input array. Each processing element comprises a plurality of multipliers and a multiple levels of accumulators. A set of weights associated with the inputs and a target output are provided to a target processing element to compute the weighted sum for the target output. The device according to the present invention reduces the computation time from M clock cycles to log.sub.2M, where M is the size of the input array.
Allocation of resources to tasks
A method of managing resources in a graphics processing pipeline includes, in response to selecting a task for execution within a texture/shading unit, allocating to the task both a static allocation of temporary registers for the entire task and a dynamic allocation of temporary registers. The dynamic allocation comprises temporary registers used by a first phase of the task only and the static allocation of temporary registers comprises any temporary registers that are used by the program and are live at a boundary between two phases. When the task subsequently reaches a boundary between two phases, the dynamic allocation of temporary registers are freed and a new dynamic allocation of temporary registers for a next phase of the task is allocated to the task.
TASK EXECUTION IN A SIMD PROCESSING UNIT WITH PARALLEL GROUPS OF PROCESSING LANES
A SIMD processing unit processes a plurality of tasks which each include up to a predetermined maximum number of work items. The work items of a task are arranged for executing a common sequence of instructions on respective data items. The data items are arranged into blocks, with some of the blocks including at least one invalid data item. Work items which relate to invalid data items are invalid work items. The SIMD processing unit comprises a group of processing lanes configured to execute instructions of work items of a particular task over a plurality of processing cycles. A control module assembles work items into the tasks based on the validity of the work items, so that invalid work items of the particular task are temporally aligned across the processing lanes. In this way the number of wasted processing slots due to invalid work items may be reduced.