Patent classifications
G06F9/3802
Method and apparatus for stateless parallel processing of tasks and workflows
In a method for parallel processing of a data stream, a processing task is received to process the data stream that includes a plurality of segments. A split operation is performed on the data stream to split the plurality of segments into N sub-streams. Each of the N sub-streams includes one or more segments of the plurality of segments. The N is a positive integer. N sub-processing tasks are performed on the N sub-streams to generate N processed sub-streams. A merge operation is performed on the N processed sub-streams based on a merge buffer to generate a merged output data stream. The merge buffer includes an output iFIFO buffer and N sub-output iFIFO buffers coupled to the output iFIFO buffer. The merged output data stream is identical to an output data stream that is generated when the processing task is applied directly to the data stream without the split operation.
Instruction execution that broadcasts and masks data values at different levels of granularity
An apparatus is described that includes an execution unit to execute a first instruction and a second instruction. The execution unit includes input register space to store a first data structure to be replicated when executing the first instruction and to store a second data structure to be replicated when executing the second instruction. The first and second data structures are both packed data structures. Data values of the first packed data structure are twice as large as data values of the second packed data structure. The execution unit also includes replication logic circuitry to replicate the first data structure when executing the first instruction to create a first replication data structure, and, to replicate the second data structure when executing the second data instruction to create a second replication data structure. The execution unit also includes masking logic circuitry to mask the first replication data structure at a first granularity and mask the second replication data structure at a second granularity. The second granularity is twice as fine as the first granularity.
METHOD AND APPARATUS TO SORT A VECTOR FOR A BITONIC SORTING ALGORITHM
A method is provided that includes performing, by a processor in response to a vector sort instruction, sorting of values stored in lanes of the vector to generate a sorted vector, wherein the values in a first portion of the lanes are sorted in a first order indicated by the vector sort instruction and the values in a second portion of the lanes are sorted in a second order indicated by the vector sort instruction; and storing the sorted vector in a storage location.
Instructions for vector multiplication of unsigned words with rounding
Disclosed embodiments relate to executing a vector multiplication instruction. In one example, a processor includes fetch circuitry to fetch the vector multiplication instruction having fields for an opcode, first and second source identifiers, and a destination identifier, decode circuitry to decode the fetched instruction, execution circuitry to, on each of a plurality of corresponding pairs of fixed-sized elements of the identified first and second sources, execute the decoded instruction to generate a double-sized product of each pair of fixed-sized elements, the double-sized product being represented by at least twice a number of bits of the fixed size, and generate an unsigned fixed-sized result by rounding the most significant fixed-sized portion of the double-sized product to fit into the identified destination.
ON-THE-FLY CONVERSION
A data processing apparatus that converts a plurality of signed digits representing an input value in redundant representation comprises receiver circuitry to receive, at each of a plurality of iterations, a signed digit from the plurality of signed digits, and previous intermediate data from a previous iteration. Concatenation circuitry performs a concatenation of bits corresponding to the signed digit and bits of the previous intermediate data to produce updated intermediate data. Output circuitry provides the updated intermediate data as previous intermediate data of a next iteration. The previous intermediate data comprises S3[i] in non-redundant representation, which is at least part of the input value multiplied by 3 in non-redundant representation.
PROCESSOR CIRCUIT AND DATA PROCESSING METHOD
A processor circuit includes an instruction decode unit, an instruction detector, an address generator and a data buffer. The instruction decode unit is configured to decode a first load instruction included in a plurality of load instructions to generate a first decoding result. The instruction detector, coupled to the instruction decode unit, is configured to detect if the load instructions use a same register. The address generator, coupled to the instruction decode unit, is configured to generate a first address requested by the first load instruction according to the first decoding result. The data buffer is coupled to the instruction detector and the address generator. When the instruction detector detects that the load instructions use the same register, the data buffer is configured to store the first address generated from the address generator, and store data requested by the first load instruction according to the first address.
LOCATION AGNOSTIC DATA ACCESS
Apparatuses, systems, and techniques to enable a program to access data regardless of where said data is stored. In at least one embodiment, a system enables a program to access data regardless of where said data is stored, based on, for example, one or more locations encoding one or more addresses of said data.
Method and apparatus to process SHA-2 secure hashing algorithm
A processor includes an instruction decoder to receive a first instruction to process a secure hash algorithm 2 (SHA-2) hash algorithm, the first instruction having a first operand associated with a first storage location to store a SHA-2 state and a second operand associated with a second storage location to store a plurality of messages and round constants. The processor further includes an execution unit coupled to the instruction decoder to perform one or more iterations of the SHA-2 hash algorithm on the SHA-2 state specified by the first operand and the plurality of messages and round constants specified by the second operand, in response to the first instruction.
Neural network processor and neural network computation method
The present disclosure provides a neural network processor and neural network computation method that deploy a memory and a cache to perform a neural network computation, where the memory may be configured to store data and instructions of the neural network computation, the cache may be connected to the memory via a memory bus, thereby, the actual compute ability of hardware may be fully utilized, the cost and power consumption overhead may be reduced, parallelism of the network may be fully utilized, and the efficiency of the neural network computation may be improved.
MISS-DRIVEN INSTRUCTION PREFETCHING
A processor may initialize a fetch of a first instruction. The processor may determine whether there is an icache miss for the first instruction. The processor may fetch the next instruction from a cache.