Patent classifications
G06F9/383
EFFICIENT WORK EXECUTION IN A PARALLEL COMPUTING SYSTEM
A computing device performs parallel computations using a set of thread processing units and a memory shuffle engine. The memory shuffle engine includes a register array to store an array of data elements retrieved from a memory buffer, and an array of input selectors. According to a first control signal, each input selector transfers at least a first data element from a corresponding subset of the register array, which is coupled to the input selector via input lines, to one or more corresponding thread processing units. According to a second control signal, each input selector transfers at least a second data element from another subset of the register array, which is coupled to another input selector via other input lines, to the one or more corresponding thread processing units.
Multiple data prefetchers that defer to one another based on prefetch effectiveness by memory access type
A processor includes a first prefetcher that prefetches data in response to memory accesses and a second prefetcher that prefetches data in response to memory accesses. Each of the memory accesses has an associated memory access type (MAT) of a plurality of predetermined MATs. The processor also includes a table that holds first scores that indicate effectiveness of the first prefetcher to prefetch data with respect to the plurality of predetermined MATs and second scores that indicate effectiveness of the second prefetcher to prefetch data with respect to the plurality of predetermined MATs. The first and second prefetchers selectively defer to one another with respect to data prefetches based on their relative scores in the table and the associated MATs of the memory accesses.
Two-dimensional zero padding in a stream of matrix elements
Software instructions are executed on a processor within a computer system to configure a steaming engine with stream parameters to define a multidimensional array. The stream parameters define a size for each dimension of the multidimensional array and a specified width for two selected dimensions of the array. Data is fetched from a memory coupled to the streaming engine responsive to the stream parameters. A stream of vectors is formed for the multidimensional array responsive to the stream parameters from the data fetched from memory. When either selected dimension in the stream of vectors exceeds a respective specified width, the streaming engine inserts null elements into each portion of a respective vector for the selected dimension that exceeds the specified width in the stream of vectors. Stream vectors that are completely null are formed by the streaming engine without accessing the system memory for respective data.
STREAMING ENGINE WITH SHORT CUT START INSTRUCTIONS
A streaming engine employed in a digital data processor specifies a fixed read only data stream recalled memory. Streams are started by one of two types of stream start instructions. A stream start ordinary instruction specifies a register storing a stream start address and a register of storing a stream definition template which specifies stream parameters. A stream start short-cut instruction specifies a register storing a stream start address and an implied stream definition template. A functional unit is responsive to a stream operand instruction to receive at least one operand from a stream head register. The stream template supports plural nested loops with short-cut start instructions limited to a single loop. The stream template supports data element promotion to larger data element size with sign extension or zero extension. A set of allowed stream short-cut start instructions includes various data sizes and promotion factors.
DUAL DATA STREAMS SHARING DUAL LEVEL TWO CACHE ACCESS PORTS TO MAXIMIZE BANDWIDTH UTILIZATION
A streaming engine employed in a digital data processor specifies fixed first and second read only data streams. Corresponding stream address generator produces address of data elements of the two streams. Corresponding steam head registers stores data elements next to be supplied to functional units for use as operands. The two streams share two memory ports. A toggling preference of stream to port ensures fair allocation. The arbiters permit one stream to borrow the other's interface when the other interface is idle. Thus one stream may issue two memory requests, one from each memory port, if the other stream is idle. This spreads the bandwidth demand for each stream across both interfaces, ensuring neither interface becomes a bottleneck.
CACHE MANAGEMENT OPERATIONS USING STREAMING ENGINE
A stream of data is accessed from a memory system using a stream of addresses generated in a first mode of operating a streaming engine in response to executing a first stream instruction. A block cache management operation is performed on a cache in the memory using a block of addresses generated in a second mode of operating the streaming engine in response to executing a second stream instruction.
MANAGING PREFETCH REQUESTS BASED ON STREAM INFORMATION FOR PREVIOUSLY RECOGNIZED STREAMS
Managing prefetch requests associated with memory access requests includes storing stream information associated with multiple streams. At least one stream was recognized based on an initial subset of memory access requests within a previously performed set of related memory access requests and is associated with stream information that includes stream matching information and stream length information. After the previously performed set has ended, a matching memory access request is identified that matches with a corresponding matched stream based at least in part on stream matching information within stream information associated with the matched stream. In response to identifying the matching memory access request, the managing determines whether or not to perform a prefetch request for data at an address related to a data address in the matching memory access request based at least in part on stream length information within the stream information associated with the matched stream.
Stream reference register with double vector and dual single vector operating modes
A streaming engine employed in a digital signal processor specifies a fixed read only data stream. Once fetched the data stream is stored in two head registers for presentation to functional units in the fixed order. Data use by the functional unit is preferably controlled using the input operand fields of the corresponding instruction. A first read only operand coding supplies data from the first head register. A first read/advance operand coding supplies data from the first head register and also advances the stream to the next sequential data elements. Corresponding second read only operand coding and second read/advance operand coding operate similarly with the second head register. A third read only operand coding supplies double width data from both head registers.
Variable latency instructions
Techniques related to executing instructions by a processor comprising receiving a first instruction for execution, determining a first latency value based on an expected amount of time needed for the first instruction to be executed, storing the first latency value in a writeback queue, beginning execution of the first instruction on the instruction execution pipeline, adjusting the latency value based on an amount of time passed since beginning execution of the first instruction, outputting a first result of the first instruction based on the latency value, receiving a second instruction, determining that the second instruction is a variable latency instruction, storing a ready value indicating that a second result of the second instruction is not ready in the writeback queue, beginning execution of the second instruction on the instruction execution pipeline, updating the ready value to indicate that the second result is ready, and outputting the second result.
STREAMING ENGINE WITH CACHE-LIKE STREAM DATA STORAGE AND LIFETIME TRACKING
A streaming engine employed in a digital data processor specifies a fixed read only data stream defined by plural nested loops. An address generator produces address of data elements. A steam head register stores data elements next to be supplied to functional units for use as operands. The streaming engine fetches stream data ahead of use by the central processing unit core in a stream buffer constructed like a cache. The stream buffer cache includes plural cache lines, each includes tag bits, at least one valid bit and data bits. Cache lines are allocated to store newly fetched stream data. Cache lines are deallocated upon consumption of the data by a central processing unit core functional unit. Instructions preferably include operand fields with a first subset of codings corresponding to registers, a stream read only operand coding and a stream read and advance operand coding.