Patent classifications
G06F9/383
STREAMING ENGINE WITH FLEXIBLE STREAMING ENGINE TEMPLATE SUPPORTING DIFFERING NUMBER OF NESTED LOOPS WITH CORRESPONDING LOOP COUNTS AND LOOP OFFSETS
A streaming engine employed in a digital data processor specifies a fixed read only data stream defined by plural nested loops. An address generator produces address of data elements for the nested loops. A steam head register stores data elements next to be supplied to functional units for use as operands. A stream template specifies loop count and loop dimension for each nested loop. A format definition field in the stream template specifies the number of loops and the stream template bits devoted to the loop counts and loop dimensions. This permits the same bits of the stream template to be interpreted differently enabling trade off between the number of loops supported and the size of the loop counts and loop dimensions.
Throttling Schemes in Multicore Microprocessors
An electronic device includes a cache, a processing cluster having one or more processors, and prefetch throttling circuitry that determines a congestion level of the processing cluster based on an extent to which the data retrieval requests sent from the processors to the cache are not satisfied by the cache. Congestion criteria require that the congestion level of the cluster is above a cluster congestion threshold. In accordance with a determination that the congestion level of the cluster satisfies the congestion criteria, the prefetch throttling circuit causes one of the processors to limit prefetch requests to the cache to prefetch requests of at least a threshold quality. In accordance with a determination that the congestion level of the cluster does not satisfy the congestion criteria, the prefetch throttling circuit forgoes causing the processors to limit prefetch requests to the cache to prefetch requests of at least the threshold quality.
Data write system and method with registers defining address range
A data write system includes a processor circuit, a first memory, at least one register, and a second memory. The first memory is coupled to the processor circuit. The at least one register is configured to define at least one range. The second memory is coupled to the first memory. If a cache miss occurs and an access address of a reading command is in the at least one range in the second memory, a predetermined amount of data corresponding to the access address is written from the second memory into at least one first way of the first memory.
Vector prefetching for computing systems
Described is a computing system for vector prefetching which includes a hierarchical memory including multiple caches, a missing address storage unit (MASU) associated with each cache which stores prefetch requests suffering a cache miss, a prefetcher which sends prefetch requests towards the hierarchical memory, and a vector prefetch unit. The vector prefetch unit determines existence of at least one of a relationship between a cache block associated with the prefetch request and cache blocks associated with one or more entries in a MASU, or a relationship between cache blocks associated with different entries in a MASU, and sends a vector prefetch request based on related prefetch requests including indicators indicating a starting cache block and a number of related cache blocks to a higher memory level to obtain data associated with each cache block. The hierarchical memory stores the data received in at least one response message from the higher memory level if available.
Prefetch strategy control for parallel execution of threads based on one or more characteristics of a stream of program instructions indicative that a data access instruction within a program is scheduled to be executed a plurality of times
A single instruction multiple thread (SIMT) processor includes execution circuitry, prefetch circuitry and prefetch strategy selection circuitry. The prefetch strategy selection circuitry serves to detect one or more characteristics of a stream of program instructions that are being executed to identify whether or not a given data access instruction within a program will be executed a plurality of times. The prefetch strategy to use is selected from a plurality of selectable prefetch strategies in dependence upon the detection of such detected characteristics.
Streaming engine with deferred exception reporting
This invention is a streaming engine employed in a digital signal processor. A fixed data stream sequence is specified by a control register. The streaming engine fetches stream data ahead of use by a central processing unit and stores it in a stream buffer. Upon occurrence of a fault reading data from memory, the streaming engine identifies the data element triggering the fault preferably storing this address in a fault address register. The streaming engine defers signaling the fault to the central processing unit until this data element is used as an operand. If the data element is never used by the central processing unit, the streaming engine never signals the fault. The streaming engine preferably stores data identifying the fault in a fault source register. The fault address register and the fault source register are preferably extended control registers accessible only via a debugger.
Coprocessor Prefetcher
A prefetcher for a coprocessor is disclosed. An apparatus includes a processor and a coprocessor that are configured to execute processor and coprocessor instructions, respectively. The processor and coprocessor instructions appear together in code sequences fetched by the processor, with the coprocessor instructions being provided to the coprocessor by the processor. The apparatus further includes a coprocessor prefetcher configured to monitor a code sequence fetched by the processor and, in response to identifying a presence of coprocessor instructions in the code sequence, capture the memory addresses, generated by the processor, of operand data for coprocessor instructions. The coprocessor is further configured to issue, for a cache memory accessible to the coprocessor, prefetches for data associated with the memory addresses prior to execution of the coprocessor instructions by the coprocessor.
Systems and methods to load a tile register pair
Embodiments detailed herein relate to systems and methods to load a tile register pair. In one example, a processor includes: decode circuitry to decode a load matrix pair instruction having fields for an opcode and source and destination identifiers to identify source and destination matrices, respectively, each matrix having a PAIR parameter equal to TRUE; and execution circuitry to execute the decoded load matrix pair instruction to load every element of left and right tiles of the identified destination matrix from corresponding element positions of left and right tiles of the identified source matrix, respectively, wherein the executing operates on one row of the identified destination matrix at a time, starting with the first row.
METHOD AND APPARATUS FOR IMPLIED BIT HANDLING IN FLOATING POINT MULTIPLICATION
A method is provided that includes performing, by a processor in response to a floating point multiply instruction, multiplication of floating point numbers, wherein determination of values of implied bits of leading bit encoded mantissas of the floating point numbers is performed in parallel with multiplication of the encoded mantissas, and storing, by the processor, a result of the floating point multiply instruction in a storage location indicated by the floating point multiply instruction.
Streaming engine with flexible streaming engine template supporting differing number of nested loops with corresponding loop counts and loop offsets
A streaming engine employed in a digital data processor specifies a fixed read only data stream defined by plural nested loops. An address generator produces address of data elements for the nested loops. A steam head register stores data elements next to be supplied to functional units for use as operands. A stream template specifies loop count and loop dimension for each nested loop. A format definition field in the stream template specifies the number of loops and the stream template bits devoted to the loop counts and loop dimensions. This permits the same bits of the stream template to be interpreted differently enabling trade off between the number of loops supported and the size of the loop counts and loop dimensions.