Patent classifications
G06F9/3808
SEQUENTIAL MONITORING AND MANAGEMENT OF CODE SEGMENTS FOR RUN-TIME PARALLELIZATION
A processor includes an instruction pipeline and control circuitry. The instruction pipeline is configured to process instructions of program code. The control circuitry is configured to monitor the processed instructions at run-time, to construct an invocation data structure comprising multiple entries, wherein each entry (i) specifies an initial instruction that is a target of a branch instruction, (ii) specifies a portion of the program code that follows one or more possible flow-control traces beginning from the initial instruction, and (iii) specifies, for each possible flow-control trace specified in the entry, a next entry that is to be processed following processing of that possible flow-control trace, and to configure the instruction pipeline to process segments of the program code, by continually traversing the entries of the invocation data structure.
Dynamic generation of logic for computing systems
Some embodiments provide a non-transitory machine-readable medium that stores a program. The program observes a parameter associated with a computing system. Upon receiving a change associated with the parameter, the program further determines a routine definition from a set of routine definitions associated with the parameter. Each routine definition in the set of routine definitions specifies a set of instructions associated with a particular parameter associated with the computing system. The program also executes the set of instructions specified in the determined routine definition.
VERIFYING COMPRESSED STREAM FUSED WITH COPY OR TRANSFORM OPERATIONS
Methods and apparatus relating to verifying a compressed stream fused with copy or transform operation(s) are described. In an embodiment, compression logic circuitry compresses input data and stores the compressed data in a temporary buffer. The compression logic circuitry determines a first checksum value corresponding to the compressed data stored in the temporary buffer. Decompression logic circuitry performs a decompress-verify operation and a copy operation. The decompress-verify operation decompresses the compressed data stored in the temporary buffer to determine a second checksum value corresponding to the decompressed data from the temporary buffer. The copy operation transfers the compressed data from the temporary buffer to a destination buffer in response to a match between the first checksum value and the second checksum value. Other embodiments are also disclosed and claimed.
POWER MANAGEMENT OF BRANCH PREDICTORS IN A COMPUTER PROCESSOR
A computer processor includes a branch prediction unit that includes a local branch predictor and a global branch predictor. Managing power consumption in such a computer processor includes, for each of a plurality of branch instructions: performing, by the local branch predictor, a local branch prediction; performing, by each of the global branch predictors, a global branch prediction; determining to utilize the local branch prediction over the global branch predictions as a branch prediction for the branch instruction; incrementing a value of a counter; determining whether the value of the counter exceeds a predetermined threshold; and if the value of the counter exceeds the predetermined threshold, powering down at least one of the global branch predictors and configuring the branch prediction unit to bypass the powered down global branch predictor for branch predictions of subsequent branch instructions.
Instruction generation process multiplexing method and device
Aspects of reusing neural network instructions are described herein. The aspects may include a computing device configured to calculate a hash value of a neural network layer based on the layer information thereof. A determination unit may be configured to determine whether the hash value exists in a hash table. If the hash value is included in the hash table, one or more neural network instructions that correspond to the hash value may be reused.
Bit string lookup data structure
Systems, apparatuses, and methods related to bit string operations using a computing tile are described. An example apparatus includes computing device (or “tile”) that includes a processing unit and a memory resource configured as a cache for the processing unit. A data structure can be coupled to the computing device. The data structure can be configured to receive a bit string that represents a result of an arithmetic operation, a logical operation, or both and store the bit string that represents the result of the arithmetic operation, the logical operation, or both. The bit string can be formatted in a format different than a floating-point format.
Instruction and logic to provide stride-based vector load-op functionality with mask duplication
Instructions and logic provide vector load-op and/or store-op with stride functionality. Some embodiments, responsive to an instruction specifying: a set of loads, a second operation, destination register, operand register, memory address, and stride length; execution units read values in a mask register, wherein fields in the mask register correspond to stride-length multiples from the memory address to data elements in memory. A first mask value indicates the element has not been loaded from memory and a second value indicates that the element does not need to be, or has already been loaded. For each having the first value, the data element is loaded from memory into the corresponding destination register location, and the corresponding value in the mask register is changed to the second value. Then the second operation is performed using corresponding data in the destination and operand registers to generate results. The instruction may be restarted after faults.
PROCESSOR WITH INSTRUCTION LOOKAHEAD ISSUE LOGIC
A processor having an instruction cache for storing a plurality of instructions is provided. The processor further includes annotation logic configured to determine a lookahead distance associated with an instruction and annotate the at least one instruction cache with the lookahead distance. The lookahead distance may correspond to a number of instructions that separates an instruction that references a register from the most recent register definition. The lookahead distance may indicate the shortest distance to a later instruction that references a register that this instruction defines.
Reordering buffer for memory access locality
Systems and methods for scheduling instructions for execution on a multi-core processor reorder the execution of different threads to ensure that instructions specified as having localized memory access behavior are executed over one or more sequential clock cycles to benefit from memory access locality. At compile time, code sequences including memory access instructions that may be localized are delineated into separate batches. A scheduling unit ensures that multiple parallel threads are processed over one or more sequential scheduling cycles to execute the batched instructions. The scheduling unit waits to schedule execution of instructions that are not included in the particular batch until execution of the batched instructions is done so that memory access locality is maintained for the particular batch. In between the separate batches, instructions that are not included in a batch are scheduled so that threads executing non-batched instructions are also processed and not starved.
SINGLE-THREAD SPECULATIVE MULTI-THREADING
A processor includes a pipeline and control circuitry. The pipeline is configured to process instructions of program code and includes one or more fetch units. The control circuitry is configured to predict at run-time one or more future flow-control traces to be traversed in the program code, to define, based on the predicted flow-control traces, two or more regions of the program code from which instructions are to be fetched, wherein the number of regions is greater than the number of fetch units, and to instruct the pipeline to fetch instructions alternately from the two or more regions of the program code using the one or more fetch units, and to process the fetched instructions.