Patent classifications
G06F9/3822
Graphics processing unit systems for performing data analytics operations in data science
Systems and methods are provided for efficiently performing processing intensive operations, such as those involving large volumes of data, that enable accelerated processing time of these operations. In at least one embodiment, a system includes a graphics processor unit (GPU) including a memory and a plurality of cores. The plurality of cores perform a plurality of data analytics operations on a respectively allocated portion of a dataset, each of the plurality of cores using only the memory to store data input for each of the plurality of data analytics operations performed by the plurality of cores. The data storage for the plurality of data analytics operations performed by the plurality of cores is also provided solely by the memory.
STREAMING ENGINE WITH FLEXIBLE STREAMING ENGINE TEMPLATE SUPPORTING DIFFERING NUMBER OF NESTED LOOPS WITH CORRESPONDING LOOP COUNTS AND LOOP OFFSETS
A streaming engine employed in a digital data processor specifies a fixed read only data stream defined by plural nested loops. An address generator produces address of data elements for the nested loops. A steam head register stores data elements next to be supplied to functional units for use as operands. A stream template specifies loop count and loop dimension for each nested loop. A format definition field in the stream template specifies the number of loops and the stream template bits devoted to the loop counts and loop dimensions. This permits the same bits of the stream template to be interpreted differently enabling trade off between the number of loops supported and the size of the loop counts and loop dimensions.
Convolutional neural network processor and data processing method thereof
A convolutional neural network processor includes an information decode unit and a convolutional neural network inference unit. The information decode unit is configured to receive a program input and weight parameter inputs and includes a decoding module and a parallel processing module. The decoding module receives the program input and produces an operational command according to the program input. The parallel processing module is electrically connected to the decoding module, receives the weight parameter inputs and includes a plurality of parallel processing sub-modules for producing a plurality of weight parameter outputs. The convolutional neural network inference unit is electrically connected to the information decode unit and includes a computing module. The computing module is electrically connected to the parallel processing module and produces an output data according to an input data and the weight parameter outputs.
SCALABLE TOGGLE POINT CONTROL CIRCUITRY FOR A CLUSTERED DECODE PIPELINE
Systems, methods, and apparatuses relating to circuitry to implement toggle point insertion for a clustered decode pipeline are described. In one example, a hardware processor core includes a first decode cluster comprising a plurality of decoder circuits, a second decode cluster comprising a plurality of decoder circuits, and a toggle point control circuit to toggle between sending instructions requested for decoding between the first decode cluster and the second decode cluster, wherein the toggle point control circuit is to: determine a location in an instruction stream as a candidate toggle point to switch the sending of the instructions requested for decoding between the first decode cluster and the second decode cluster, track a number of times a characteristic of multiple previous decodes of the instruction stream is present for the location, and cause insertion of a toggle point at the location, based on the number of times, to switch the sending of the instructions requested for decoding between the first decode cluster and the second decode cluster.
Method and device for simultaneously decoding data in parallel to improve quality of service
The present disclosure generally relates to a method and device for simultaneously decoding data. Rather than sending data to be decoded to a single decoder, the data can be sent to multiple, available decoders so that the data can be decode in parallel. The data decoded from the first decoder that completes decoding of the data will be delivered to the host device. All remaining decoded data that was decoded in parallel will be discarded. The decoders operating simultaneously in parallel can operate using different parameters such as different calculation precision (power levels). By utilizing multiple decoders simultaneously in parallel, the full functionality of the data storage device's decoding capabilities are utilized without increasing latency.
METHOD AND APPARATUS FOR IMPLIED BIT HANDLING IN FLOATING POINT MULTIPLICATION
A method is provided that includes performing, by a processor in response to a floating point multiply instruction, multiplication of floating point numbers, wherein determination of values of implied bits of leading bit encoded mantissas of the floating point numbers is performed in parallel with multiplication of the encoded mantissas, and storing, by the processor, a result of the floating point multiply instruction in a storage location indicated by the floating point multiply instruction.
Streaming engine with flexible streaming engine template supporting differing number of nested loops with corresponding loop counts and loop offsets
A streaming engine employed in a digital data processor specifies a fixed read only data stream defined by plural nested loops. An address generator produces address of data elements for the nested loops. A steam head register stores data elements next to be supplied to functional units for use as operands. A stream template specifies loop count and loop dimension for each nested loop. A format definition field in the stream template specifies the number of loops and the stream template bits devoted to the loop counts and loop dimensions. This permits the same bits of the stream template to be interpreted differently enabling trade off between the number of loops supported and the size of the loop counts and loop dimensions.
System and method of VLIW instruction processing using reduced-width VLIW processor
Very long instruction word (VLIW) instruction processing using a reduced-width processor is disclosed. In a particular embodiment, a VLIW processor includes a control circuit configured to receive a VLIW packet that includes a first number of instructions and to distribute the instructions to a second number of instruction execution paths. The first number is greater than the second number. The VLIW processor also includes physical registers configured to store results of executing the instructions and a register renaming circuit that is coupled to the control circuit.
Information processing apparatus, information processing method and non-transitory computer-readable storage medium for storing information processing program of determining relations among nodes in N-dimensional torus structure
An information processing apparatus for controlling a plurality of nodes mutually coupled via a plurality of cables, the apparatus includes: a memory; a processor coupled to the memory, the processor being configured to cause a first node to execute first processing to extract coupling relationship between the plurality of nodes, the first node being one of the plurality of nodes, being sequentially allocated from each of the plurality of nodes, the first processing including executing allocation processing that allocates unique coordinate information to the first node and allocates common coordinate information to nodes excluding the first node; executing transmission processing that causes the first node to transmit first information to each of the cables coupled to the first node; and executing identification processing that identifies a node having received the first information as neighboring node coupled to one of the plurality of cables coupled to the first node.
Microprocessor with shared functional unit for executing multi-type instructions
A microprocessor that includes a shared functional unit, a first execution queue and a second execution queue is introduced. The first execution queue includes a plurality of entries, wherein each entry of the first execution queue includes a first count value which is decremented until the first count value reaches 0. The first execution queue dispatches the first-type instruction to the shared functional unit when the first count value reaches 0. The second execution queue include a plurality of entries, wherein each entry of the second execution queue comprises a second count value which is decremented until the second count value reaches 0. The second execution queue dispatches the second-type instruction to the shared functional unit when the second count value reaches 0. The issue unit resolves all data dependencies and resource conflicts so that the first and second count values are preset for the first-type and second-type instructions to be mutually executed at the exact time in the future by the shared functional unit.