Patent classifications
G06F9/3016
EXECUTING SYSTEM CALL VECTORED INSTRUCTIONS IN A MULTI-SLICE PROCESSOR
Executing system call vectored (SCV) instructions in a multi-slice processor including receiving, by an instruction fetch unit, a SCV instruction, wherein the SCV instruction is a system call from an operating system; sending the SCV instruction to a branch issue queue; determining, by the branch issue queue, that the SCV instruction is next-to-complete; issuing the SCV instruction to a branch resolution unit; and executing the SCV instruction by the branch resolution unit.
Method and apparatus for permuting streamed data elements
A method is provided that includes receiving, in a permute network, a plurality of data elements for a vector instruction from a streaming engine, and mapping, by the permute network, the plurality of data elements to vector locations for execution of the vector instruction by a vector functional unit in a vector data path of a processor.
Method and apparatus for decompression acceleration in multi-cycle decoder based platforms
In one embodiment, an apparatus comprises a decompression engine to perform a non-speculative decode operation on a first portion of a first compressed payload comprising a first plurality of codes; and perform a speculative decode operation on a second portion of the first compressed payload, wherein the non-speculative decode operation and the speculative decode operation share at least one decode path and the non-speculative decode operation is to utilize bandwidth of the at least one decode path that is not used by the non-speculative decode operation.
Method and apparatus for implied bit handling in floating point multiplication
A method is provided that includes performing, by a processor in response to a floating point multiply instruction, multiplication of floating point numbers, wherein determination of values of implied bits of leading bit encoded mantissas of the floating point numbers is performed in parallel with multiplication of the encoded mantissas, and storing, by the processor, a result of the floating point multiply instruction in a storage location indicated by the floating point multiply instruction.
SYSTEM AND METHOD ENABLING ONE-HOT NEURAL NETWORKS ON A MACHINE LEARNING COMPUTE PLATFORM
One embodiment provides for a graphics processor comprising a cache memory and a graphics core coupled with the cache memory. The graphics core includes circuitry configured to generate an approximate weight matrix including a set of one-hot coded weights, perform a forward compute pass with mini batch samples to compute a loss function, perform a backward compute pass to compute a gradient update via stochastic gradient descent according to a loss update, and update the approximate weight matrix based on the gradient update to generate an updated weight matrix.
Enhanced Low Cost Microcontroller
An 8-bit microprocessor has a program memory having a 16-bit instruction word size and a data memory having an 8-bit data size. An instruction word has a payload size for an address of up to 12 bits. The microprocessor furthermore has a central processing unit coupled with the program memory and the data memory, a bank select register configured to select one of up to 64 memory banks, and an indirect addressing register operable to address up to 16 KB of data memory. The CPU is configured to execute a first move instruction having two instruction words and being configured to only access the lower 4 KB of the data memory and a second move instruction having three instruction words and configured to access the entire data memory.
COMPUTATION PROCESSING APPARATUS AND METHOD OF PROCESSING COMPUTATION
A computation processing apparatus includes: a memory; and a processor coupled to the memory and configured to: decode instructions; execute the instructions which is decoded and operate as a plurality of sub-computation processing apparatuses in accordance with a bit width of data to be computed; and observe an operation state of the computation processing apparatus, wherein, when observing that a subset of the plurality of sub-computation processing apparatuses does not execute an instruction or instructions, the processor parallelizes the instructions and outputs the parallelized instructions.
Systems and methods for performing 16-bit floating-point vector dot product instructions
Disclosed embodiments relate to systems and methods for performing 16-bit floating-point vector dot product instructions. In one example, a processor includes fetch circuitry to fetch an instruction having fields to specify an opcode and locations of first source, second source, and destination vectors, the opcode to indicate execution circuitry is to multiply N pairs of 16-bit floating-point formatted elements of the specified first and second sources, and accumulate the resulting products with previous contents of a corresponding single-precision element of the specified destination, decode circuitry to decode the fetched instruction, and execution circuitry to respond to the decoded instruction as specified by the opcode.
Sharing instruction encoding space between a coprocessor and auxiliary execution circuitry
Data processing apparatuses, methods of data processing, and non-transitory computer-readable media on which computer-readable code is stored defining logical configurations of processing devices are disclosed. In an apparatus, fetch circuitry retrieves a sequence of instructions and execution circuitry performs data processing operations with respect to data values in a set of registers. An auxiliary execution circuitry interface and a coprocessor interface to provide a connection to a coprocessor outside the apparatus are provided. Decoding circuitry generates control signals in dependence on the instructions of the sequence of instructions, wherein the decoding circuitry is responsive to instructions in a first subset of an instruction encoding space to generate the control signals to control the execution circuitry to perform the first data processing operations, and the decoding circuitry is responsive to instructions in a second subset of the instruction encoding space to generate the control signals in dependence on a configuration condition: to generate the control signals for the auxiliary execution circuitry interface when the configuration condition has a first state; and to generate the control signals for the coprocessor interface when the configuration condition has a second state.
Systems and methods for combining low-mantissa units to achieve and exceed FP64 emulation of matrix multiplication
The present disclosure relates to an apparatus that includes decoding circuitry that decodes a single instruction. The single instruction includes an identifier of a first source operand, an identifier of a second source operand, an identifier of a destination, and an opcode indicative of execution circuitry is to multiply from the identified first source operand and the identified second source operand and store a result in the identified destination. Additionally, the apparatus includes execution circuitry to execute the single decoded instruction to calculate a dot product by calculating a plurality of products using data elements of the identified first and second operands using values less precise than the identified first and second source operands, summing the calculated products, and storing the summed products in the destination.