G06F9/3818

Reconfigurable Multi-Thread Processor for Simultaneous Operations on Split Instructions and Operands
20220308889 · 2022-09-29 · ·

A superscalar processor has a thread mode of operation for supporting multiple instruction execution threads which are full data path wide instructions, and a micro-thread mode of operation where each thread supports two micro-threads which independently execute instructions. An executed instruction sets a micro-thread mode and an executed instruction sets the thread mode.

Method and apparatus for permuting streamed data elements

A method is provided that includes receiving, in a permute network, a plurality of data elements for a vector instruction from a streaming engine, and mapping, by the permute network, the plurality of data elements to vector locations for execution of the vector instruction by a vector functional unit in a vector data path of a processor.

Efficient implementation of complex vector fused multiply add and complex vector multiply

Disclosed embodiments relate to efficient complex vector multiplication. In one example, an apparatus includes execution circuitry, responsive to an instruction having fields to specify multiplier, multiplicand, and summand complex vectors, to perform two operations: first, to generate a double-even multiplicand by duplicating even elements of the specified multiplicand, and to generate a temporary vector using a fused multiply-add (FMA) circuit having A, B, and C inputs set to the specified multiplier, the double-even multiplicand, and the specified summand, respectively, and second, to generate a double-odd multiplicand by duplicating odd elements of the specified multiplicand, to generate a swapped multiplier by swapping even and odd elements of the specified multiplier, and to generate a result using a second FMA circuit having its even product negated, and having A, B, and C inputs set to the swapped multiplier, the double-odd multiplicand, and the temporary vector, respectively.

Opportunity multithreading in a multithreaded processor with instruction chaining capability

A computing device determines that a current software thread of a plurality of software threads having an issuing sequence does not have a first instruction waiting to be issued to a hardware thread during a clock cycle. The computing device identifies one or more alternative software threads in the issuing sequence having instructions waiting to be issued. The computing device selects, during the clock cycle by the computing device, a second instruction from a second software thread among the one or more alternative software threads in view of determining that the second instruction has no dependencies with any other instructions among the instructions waiting to be issued. Dependencies are identified by the computing device in view of the values of a chaining bit extracted from each of the instructions waiting to be issued. The computing device issues the second instruction to the hardware thread.

Page modification encoding and caching

Modifying a page stored in a non-volatile storage includes receiving one or more requests to modify data stored in the page with new data. One or more lines are identified in the page that include data to be modified by the one or more requests. The identified one or more lines correspond to one or more respective byte ranges each of a predetermined size in the page. Encoded data is created based on the new data and respective locations of the one or more identified lines in the page. The encoded data is cached, and at least a portion of the cached encoded data is used to rewrite the page in the non-volatile storage to include at least a portion of the new data.

SYSTEMS, METHODS, AND APPARATUS FOR TILE CONFIGURATION

Embodiments detailed herein relate to matrix (tile) operations. For example, decode circuitry to decode an instruction having fields for an opcode and a memory address; and execution circuitry to execute the decoded instruction to set a tile configuration for the processor to utilize tiles in matrix operations based on a description retrieved from the memory address, wherein a tile a set of 2-dimensional registers are discussed.

Circuit for increasing voltage swing of a local oscillator waveform signal

A bootstrap circuit for increasing the voltage swing of a local oscillator waveform signal. The bootstrap circuit comprises a driver stage for driving at an output thereof a local oscillator waveform signal having an increased voltage swing. The driver stage comprises a first supply voltage node and a second supply voltage node. The bootstrap circuit further comprises at least one energy storage component arranged to store energy within an energy storage element when the voltage level at the input node of the driver stage comprises the second voltage state and use the energy stored within the energy storage element to generate an increased voltage level, and to apply the increased voltage level to the first supply voltage node of the driver stage when the voltage level at the input node of the driver stage comprises the first voltage state.

Systems, Apparatuses, And Methods For Fused Multiply Add

Embodiments of systems, apparatuses, and methods for fused multiple add. In some embodiments, a decoder decodes a single instruction having an opcode, a destination field representing a destination operand, and fields for a first, second, and third packed data source operand, wherein packed data elements of the first and second packed data source operand are of a first, different size than a second size of packed data elements of the third packed data operand. Execution circuitry then executes the decoded single instruction to perform, for each packed data element position of the destination operand, a multiplication of a M N-sized packed data elements from the first and second packed data sources that correspond to a packed data element position of the third packed data source, add of results from these multiplications to a full-sized packed data element of a packed data element position of the third packed data source, and storage of the addition result in a packed data element position destination corresponding to the packed data element position of the third packed data source, wherein M is equal to the full-sized packed data element divided by N.

MANAGING PREFETCH REQUESTS BASED ON STREAM INFORMATION FOR PREVIOUSLY RECOGNIZED STREAMS
20210406184 · 2021-12-30 ·

Managing prefetch requests associated with memory access requests includes storing stream information associated with multiple streams. At least one stream was recognized based on an initial subset of memory access requests within a previously performed set of related memory access requests and is associated with stream information that includes stream matching information and stream length information. After the previously performed set has ended, a matching memory access request is identified that matches with a corresponding matched stream based at least in part on stream matching information within stream information associated with the matched stream. In response to identifying the matching memory access request, the managing determines whether or not to perform a prefetch request for data at an address related to a data address in the matching memory access request based at least in part on stream length information within the stream information associated with the matched stream.

Mixed-precision compression with random access
11211944 · 2021-12-28 · ·

A data compressor includes a zero-value remover, a zero bit mask generator and a non-zero values packer. The zero-value remover receives 2.sup.N bit streams of values and outputs 2.sup.N non-zero-value bit streams having zero values removed from each respective bit stream based on a selected granularity of compression for values contained in the bit streams. The zero bit mask generator receives the 2.sup.N bit streams of values and generates a zero bit mask corresponding to the selected granularity of compression. Each zero bit mask indicates a location of a zero value based on the selected granularity of compression. The non-zero values packer receives the 2.sup.N non-zero-value bit streams and forms at least one first group of packed non-zero values.