Patent classifications
G06F9/30018
INDEXING EXTERNAL MEMORY IN A RECONFIGURABLE COMPUTE FABRIC
Various examples are directed to systems and methods in which a flow controller of a first synchronous flow may receive an instruction to execute a first loop using the first synchronous flow. The flow controller may determine a first iteration index for a first iteration of the first loop. The flow controller may send, to a first compute element of the first synchronous flow, a first synchronous message to initiate a first synchronous flow thread for executing the first iteration of the first loop. The first synchronous message may comprise the iteration index. The first compute element may execute an input/output operation at a first location of a first compute element memory indicated by the first iteration index.
Instructions for vector operations with constant values
Disclosed embodiments relate to instructions for vector operations with immediate values. In one example, a system includes a memory and a processor that includes fetch circuitry to fetch the instruction from a code storage, the instruction including an opcode, a destination identifier to specify a destination vector register, a first immediate, and a write mask identifier to specify a write mask register, the write mask register including at least one bit corresponding to each destination vector register element, the at least one bit to specify whether the destination vector register element is masked or unmasked, decode circuitry to decode the fetched instruction, and execution circuitry to execute the decoded instruction, to, use the write mask register to determine unmasked elements of the destination vector register, and, when the opcode specifies to broadcast, broadcast the first immediate to one or more unmasked vector elements of the destination vector register.
METHOD AND APPARATUS FOR VECTOR SORTING USING VECTOR PERMUTATION LOGIC
A method for sorting of a vector in a processor is provided that includes performing, by the processor in response to a vector sort instruction, generating a control input vector for vector permutation logic comprised in the processor based on values in lanes of the vector and a sort order for the vector indicated by the vector sort instruction and storing the control input vector in a storage location.
Vector SIMD VLIW data path architecture
A Very Long Instruction Word (VLIW) digital signal processor particularly adapted for single instruction multiple data (SIMD) operation on various operand widths and data sizes. A vector compare instruction compares first and second operands and stores compare bits. A companion vector conditional instruction performs conditional operations based upon the state of a corresponding predicate data register bit. A predicate unit performs data processing operations on data in at least one predicate data register including unary operations and binary operations. The predicate unit may also transfer data between a general data register file and the predicate data register file.
APPARATUS AND METHOD FOR PROPAGATING CONDITIONALLY EVALUATED VALUES IN SIMD/VECTOR EXECUTION USING AN INPUT MASK REGISTER
An apparatus and method for propagating conditionally evaluated values are disclosed. For example, a method according to one embodiment comprises: reading each value contained in an input mask register, each value being a true value or a false value and having a bit position associated therewith; for each true value read from the input mask register, generating a first result containing the bit position of the true value; for each false value read from the input mask register following the first true value, adding the vector length of the input mask register to a bit position of the last true value read from the input mask register to generate a second result; and storing each of the first results and second results in bit positions of an output register corresponding to the bit positions read from the input mask register.
Clock mesh-based power conservation in a coprocessor based on in-flight instruction characteristics
A pipeline includes a first portion configured to process a first subset of bits of an instruction and a second portion configured to process a second subset of the bits of the instruction. A first clock mesh is configured to provide a first clock signal to the first portion of the pipeline. A second clock mesh is configured to provide a second clock signal to the second portion of the pipeline. The first and second clock meshes selectively provide the first and second clock signals based on characteristics of in-flight instructions that have been dispatched to the pipeline but not yet retired. In some cases, a physical register file is configured to store values of bits representative of instructions. Only the first subset is stored in the physical register file in response to the value of the zero high bit indicating that the second subset is equal to zero.
Method and apparatus for retaining optimal width vector operations in arbitrary/flexible vector width architecture
A method and apparatus to optimize a list of vector instructions using dynamic programming, in particular memoization, by generating a table containing instruction subvectors having individual (parts), contiguous (superparts) and repeated (broadcasts) lanes. Because the instructions in the table are subvectors selected to have individual, contiguous and repeated lanes in the registers, compiler optimizations can be enhanced. Introduction of such dynamic programming allows for speculative lane optimizations, as well as improved analysis-guided optimizations, either of which can be performed alone or in combination with other optimizations, whether or not they make use of dynamic programming.
Channel-parallel compression with random memory access
A data compressor a zero-value remover, a zero bit mask generator, a non-zero values packer, and a row-pointer generator. The zero-value remover receives 2.sup.N bit streams of values and outputs 2.sup.N non-zero-value bit streams having zero values removed from each respective bit stream. The zero bit mask generator receives the 2.sup.N bit streams of values and generates a zero bit mask for a predetermined number of values of each bit stream in which each zero bit mask indicates a location of a zero value in the predetermined number of values corresponding to the zero bit mask. The non-zero values packer receives the 2.sup.N non-zero-value bit streams and forms a group of packed non-zero values. The row-pointer generator that generates a row-pointer for each group of packed non-zero values.
Write cache circuit, data write method, and memory
The present disclosure provides a write cache circuit, a data write method, and a memory. The write cache circuit includes: a control circuit configured to generate, on the basis of a mask write instruction, a first write pointer and a pointer to be positioned, generate a second write pointer on the basis of a write command, generate a first output pointer on the basis of a mask write shift instruction, and generate a second output pointer on the basis of a write shift instruction; a first cache circuit configured to cache, on the basis of the first write pointer, the pointer to be positioned and output a positioned pointer on the basis of the first output pointer, the positioned pointer being configured to instruct a second cache circuit to output a write address written by the second write pointer generated according to the mask write instruction.
Instruction execution that broadcasts and masks data values at different levels of granularity
An apparatus is described that includes an execution unit to execute a first instruction and a second instruction. The execution unit includes input register space to store a first data structure to be replicated when executing the first instruction and to store a second data structure to be replicated when executing the second instruction. The first and second data structures are both packed data structures. Data values of the first packed data structure are twice as large as data values of the second packed data structure. The execution unit also includes replication logic circuitry to replicate the first data structure when executing the first instruction to create a first replication data structure, and, to replicate the second data structure when executing the second data instruction to create a second replication data structure. The execution unit also includes masking logic circuitry to mask the first replication data structure at a first granularity and mask the second replication data structure at a second granularity. The second granularity is twice as fine as the first granularity.