G06F7/575

Graphics processors and graphics processing units having dot product accumulate instruction for hybrid floating point format

Described herein is a graphics processing unit (GPU) comprising a first processing cluster to perform parallel processing operations, the parallel processing operations including a ray tracing operation and a matrix multiply operation; and a second processing cluster coupled to the first processing cluster, wherein the first processing cluster includes a floating-point unit to perform floating point operations, the floating-point unit is configured to process an instruction using a bfloat16 (BF16) format with a multiplier to multiply second and third source operands while an accumulator adds a first source operand with output from the multiplier.

Graphics processors and graphics processing units having dot product accumulate instruction for hybrid floating point format

Described herein is a graphics processing unit (GPU) comprising a first processing cluster to perform parallel processing operations, the parallel processing operations including a ray tracing operation and a matrix multiply operation; and a second processing cluster coupled to the first processing cluster, wherein the first processing cluster includes a floating-point unit to perform floating point operations, the floating-point unit is configured to process an instruction using a bfloat16 (BF16) format with a multiplier to multiply second and third source operands while an accumulator adds a first source operand with output from the multiplier.

NEURAL NETWORKS FOR EMBEDDED DEVICES
20230237331 · 2023-07-27 ·

A neural network architecture is used that reduces the processing load of implementing the neural network. This network architecture may thus be used for reduced-bit processing devices. The architecture may limit the number of bits used for processing and reduce processing to prevent data overflow at individual calculations of the neural network. To implement this architecture, the number of bits used to represent inputs at levels of the network and the related filter masks may also be modified to ensure the number of bits of the output does not overflow the resulting capacity of the reduced-bit processor. To additionally reduce the load for such a network, the network may implement a “starconv” structure that permits the incorporation of nearby nodes in a layer to balance processing requirements and permit the network to learn from context of other nodes.

NEURAL NETWORKS FOR EMBEDDED DEVICES
20230237331 · 2023-07-27 ·

A neural network architecture is used that reduces the processing load of implementing the neural network. This network architecture may thus be used for reduced-bit processing devices. The architecture may limit the number of bits used for processing and reduce processing to prevent data overflow at individual calculations of the neural network. To implement this architecture, the number of bits used to represent inputs at levels of the network and the related filter masks may also be modified to ensure the number of bits of the output does not overflow the resulting capacity of the reduced-bit processor. To additionally reduce the load for such a network, the network may implement a “starconv” structure that permits the incorporation of nearby nodes in a layer to balance processing requirements and permit the network to learn from context of other nodes.

Adder circuit using lookup tables

A four-input lookup table (“LUT4”) is modified to operate in a first mode as an ordinary LUT4 and in a second mode as a 1-bit adder providing a sum output and a carry output. A six-input lookup table (“LUT6”) is modified to operate in a first mode as an ordinary LUT6 with a single output and in a second mode as a 2-bit adder providing a sum output and a carry output. Both possible results for the two different possible carry inputs can be determined and selected between when the carry input is available, implementing a 2-bit carry-select adder when in the second mode and retaining the ability to operate as an ordinary LUT6 in the first mode. Using the novel LUT6 design in a circuit chip fabric allows a 2-bit adder slice to be built that efficiently makes use of the LUT6 without requiring additional logic blocks.

Adder circuit using lookup tables

A four-input lookup table (“LUT4”) is modified to operate in a first mode as an ordinary LUT4 and in a second mode as a 1-bit adder providing a sum output and a carry output. A six-input lookup table (“LUT6”) is modified to operate in a first mode as an ordinary LUT6 with a single output and in a second mode as a 2-bit adder providing a sum output and a carry output. Both possible results for the two different possible carry inputs can be determined and selected between when the carry input is available, implementing a 2-bit carry-select adder when in the second mode and retaining the ability to operate as an ordinary LUT6 in the first mode. Using the novel LUT6 design in a circuit chip fabric allows a 2-bit adder slice to be built that efficiently makes use of the LUT6 without requiring additional logic blocks.

Neural networks for embedded devices

A neural network architecture is used that reduces the processing load of implementing the neural network. This network architecture may thus be used for reduced-bit processing devices. The architecture may limit the number of bits used for processing and reduce processing to prevent data overflow at individual calculations of the neural network. To implement this architecture, the number of bits used to represent inputs at levels of the network and the related filter masks may also be modified to ensure the number of bits of the output does not overflow the resulting capacity of the reduced-bit processor. To additionally reduce the load for such a network, the network may implement a “starconv” structure that permits the incorporation of nearby nodes in a layer to balance processing requirements and permit the network to learn from context of other nodes.

Neural networks for embedded devices

A neural network architecture is used that reduces the processing load of implementing the neural network. This network architecture may thus be used for reduced-bit processing devices. The architecture may limit the number of bits used for processing and reduce processing to prevent data overflow at individual calculations of the neural network. To implement this architecture, the number of bits used to represent inputs at levels of the network and the related filter masks may also be modified to ensure the number of bits of the output does not overflow the resulting capacity of the reduced-bit processor. To additionally reduce the load for such a network, the network may implement a “starconv” structure that permits the incorporation of nearby nodes in a layer to balance processing requirements and permit the network to learn from context of other nodes.

Accelerated mathematical engine

Various embodiments of the disclosure relate to an accelerated mathematical engine. In certain embodiments, the accelerated mathematical engine is applied to image processing such that convolution of an image is accelerated by using a two-dimensional matrix processor comprising sub-circuits that include an ALU, output register and shadow register. This architecture supports a clocked, two-dimensional architecture in which image data and weights are multiplied in a synchronized manner to allow a large number of mathematical operations to be performed in parallel.

Accelerated mathematical engine

Various embodiments of the disclosure relate to an accelerated mathematical engine. In certain embodiments, the accelerated mathematical engine is applied to image processing such that convolution of an image is accelerated by using a two-dimensional matrix processor comprising sub-circuits that include an ALU, output register and shadow register. This architecture supports a clocked, two-dimensional architecture in which image data and weights are multiplied in a synchronized manner to allow a large number of mathematical operations to be performed in parallel.