Patent classifications
G06F9/3895
INSTRUCTIONS AND LOGIC TO PROVIDE SIMD SM4 CRYPTOGRAPHIC BLOCK CIPHER FUNCTIONALITY
Instructions and logic provide for a Single Instruction Multiple Data (SIMD) SM4 round slice operation. Embodiments of an instruction specify a first and a second source data operand set, and substitution function indicators, e.g. in an immediate operand. Embodiments of a processor may include encryption units, responsive to the first instruction, to: perform a slice of SM4-round exchanges on a portion of the first source data operand set with a corresponding keys from the second source data operand set in response to a substitution function indicator that indicates a first substitution function, perform a slice of SM4 key generations using another portion of the first source data operand set with corresponding constants from the second source data operand set in response to a substitution function indicator that indicates a second substitution function, and store a set of result elements of the first instruction in a SIMD destination register.
Processor and instruction scheduling method
A processor and an instruction scheduling method for X-channel interleaved multi-threading, where X is an integer greater than one. The processor includes a decoding unit and a processing unit. The decoding unit is configured to obtain one instruction from each of Z predefined threads in each cyclic period, decode the Z obtained instructions to obtain Z decoding results, and send the Z decoding results to the processing unit, where each cyclic period includes X sending periods, one decoding result is sent to the processing unit in each sending period, a decoding result of the Z decoding results may be repeatedly sent by the decoding unit in a plurality of sending periods, wherein 1≤Z<X or Z=X, and wherein Z is an integer. The processing unit (32) is configured to execute the instruction based on the decoding result.
Vector processing engines (VPEs) employing a tapped-delay line(s) for providing precision filter vector processing operations with reduced sample re-fetching and power consumption, and related vector processor systems and methods
Vector processing engines (VPEs) employing a tapped-delay line(s) for providing precision filter vector processing operations with reduced sample re-fetching and power consumption are disclosed. Related vector processor systems and methods are also disclosed. The VPEs are configured to provide filter vector processing operations. To minimize re-fetching of input vector data samples from memory to reduce power consumption, a tapped-delay line(s) is included in the data flow paths between a vector data file and execution units in the VPE. The tapped-delay line(s) is configured to receive and provide input vector data sample sets to execution units for performing filter vector processing operations. The tapped-delay line(s) is also configured to shift the input vector data sample set for filter delay taps and provide the shifted input vector data sample set to execution units, so the shifted input vector data sample set does not have to be re-fetched during filter vector processing operations.
System and methods for expandably wide processor instructions
Expandably wide operations are disclosed in which operands wider than the data path between a processor and memory are used in executing instructions. The expandably wide operands reduce the influence of the characteristics of the associated processor in the design of functional units performing calculations, including the width of the register file, the processor clock rate, the exception subsystem of the processor, and the sequence of operations in loading and use of the operand in a wide cache memory.
Super multiply add (super madd) instruction
A method of processing an instruction is described that includes fetching and decoding the instruction. The instruction has separate destination address, first operand source address and second operand source address components. The first operand source address identifies a location of a first mask pattern in mask register space. The second operand source address identifies a location of a second mask pattern in the mask register space. The method further includes fetching the first mask pattern from the mask register space; fetching the second mask pattern from the mask register space; merging the first and second mask patterns into a merged mask pattern; and, storing the merged mask pattern at a storage location identified by the destination address.
Unified and compressed statistical analysis data
Systems and methods for compression and/or unification of statistical analysis system (SAS) data is provided. In one embodiment, a request to open a unified and compressed statistical analysis system (SAS) view file is received. The unified and compressed SAS data step view file including: an SAS data step view; compressed payload data to be used in the SAS data step view when decompressed; and a set of metadata describing characteristics of variables of the SAS data step view. Upon receiving the request, the compressed payload data is automatically decompressed, such that compressed payload data is decompressed and usable with the SAS data step view to render the SAS data step view and decompressed payload data on an electronic display of a client or host providing the request.
Programmable coarse grained and sparse matrix compute hardware with advanced scheduling
- Eriko Nurvitadhi ,
- Balaji Vembu ,
- Nicolas C. Galoppo Von Borries ,
- Rajkishore Barik ,
- Tsung-Han Lin ,
- Kamal Sinha ,
- Nadathur Rajagopalan Satish ,
- Jeremy Bottleson ,
- Farshad Akhbari ,
- Altug Koker ,
- Narayan Srinivasa ,
- Dukhwan Kim ,
- Sara S. Baghsorkhi ,
- Justin E. Gottschlich ,
- Feng Chen ,
- Elmoustapha Ould-Ahmed-Vall ,
- Kevin Nealis ,
- Xiaoming Chen ,
- Anbang Yao
One embodiment provides for a compute apparatus to perform machine learning operations, the compute apparatus comprising a decode unit to decode a single instruction into a decoded instruction, the decoded instruction to cause the compute apparatus to perform a complex machine learning compute operation.
MICROPROCESSOR WITH HIGH-EFFICIENCY DECODING OF COMPLEX INSTRUCTIONS
Microcode combination of complex instructions is shown. A microprocessor includes an instruction queue, an instruction decoder, and a microcode controller. The instruction decoder is coupled to the instruction queue. The microcode controller is coupled to the instruction decoder and has a memory. The memory stores a combined microcode for M complex instructions arranged in a specific order, where M is an integer greater than 1. When the M complex instructions in the specific order have popped out of the first to M-th entries of the instruction queue, the instruction decoder operates the microcode controller to read the memory for the combined microcode with microcode reading trapping happened just once.
Microprocessor with high-efficiency decoding of complex instructions
Microcode combination of complex instructions is shown. A microprocessor includes an instruction queue, an instruction decoder, and a microcode controller. The instruction decoder is coupled to the instruction queue. The microcode controller is coupled to the instruction decoder and has a memory. The memory stores a combined microcode for M complex instructions arranged in a specific order, where M is an integer greater than 1. When the M complex instructions in the specific order have popped out of the first to M-th entries of the instruction queue, the instruction decoder operates the microcode controller to read the memory for the combined microcode with microcode reading trapping happened just once.
PROGRAMMABLE COARSE GRAINED AND SPARSE MATRIX COMPUTE HARDWARE WITH ADVANCED SCHEDULING
- Eriko Nurvitadhi ,
- Balaji Vembu ,
- Nicolas C. Galoppo Von Borries ,
- Rajkishore Barik ,
- Tsung-Han Lin ,
- Kamal Sinha ,
- Nadathur Rajagopalan Satish ,
- Jeremy Bottleson ,
- Farshad Akhbari ,
- Altug Koker ,
- Narayan Srinivasa ,
- Dukhwan Kim ,
- Sara S. Baghsorkhi ,
- Justin E. Gottschlich ,
- Feng Chen ,
- Elmoustapha Ould-Ahmed-Vall ,
- Kevin Nealis ,
- Xiaoming Chen ,
- Anbang Yao
One embodiment provides for a compute apparatus to perform machine learning operations, the compute apparatus comprising a decode unit to decode a single instruction into a decoded instruction, the decoded instruction to cause the compute apparatus to perform a complex compute operation.