G06F7/523

PARTIAL SUM MANAGEMENT AND RECONFIGURABLE SYSTOLIC FLOW ARCHITECTURES FOR IN-MEMORY COMPUTATION
20230047364 · 2023-02-16 ·

Methods and apparatus for performing machine learning tasks, and in particular, to a neural-network-processing architecture and circuits for improved handling of partial accumulation results in weight-stationary operations, such as operations occurring in compute-in-memory (CIM) processing elements (PEs). One example PE circuit for machine learning generally includes an accumulator circuit, a flip-flop array having an input coupled to an output of the accumulator circuit, a write register, and a first multiplexer having a first input coupled to an output of the write register, having a second input coupled to an output of the flip-flop array, and having an output coupled to a first input of the first accumulator circuit.

PARTIAL SUM MANAGEMENT AND RECONFIGURABLE SYSTOLIC FLOW ARCHITECTURES FOR IN-MEMORY COMPUTATION
20230047364 · 2023-02-16 ·

Methods and apparatus for performing machine learning tasks, and in particular, to a neural-network-processing architecture and circuits for improved handling of partial accumulation results in weight-stationary operations, such as operations occurring in compute-in-memory (CIM) processing elements (PEs). One example PE circuit for machine learning generally includes an accumulator circuit, a flip-flop array having an input coupled to an output of the accumulator circuit, a write register, and a first multiplexer having a first input coupled to an output of the write register, having a second input coupled to an output of the flip-flop array, and having an output coupled to a first input of the first accumulator circuit.

COMBINING REGULAR AND SYMBOLIC NTTS USING CO-PROCESSORS

Various embodiments relate to a method for multiplying a first and a second polynomial in a ring custom-character.sub.q[X]/(X.sup.n+1) to perform a cryptographic operation in a data processing system where q is a positive integer, the method for use in a processor of the data processing system, comprising: receiving the first polynomial and the second polynomial by the processor; mapping the first polynomial into k smaller third polynomials over k smaller rings based upon primitive roots of unity, where k is a positive integer; mapping the second polynomial into k smaller fourth polynomials over the k smaller rings based upon primitive roots of unity; applying an isomorphism to the k third polynomials resulting in k fifth polynomials; applying the isomorphism to the k fourth polynomials resulting in k sixth polynomials; applying a Kronecker substitution on the k fifth polynomials and the k sixth polynomials and perform the multiplication of the k fifth polynomials and the k sixth polynomials to produce a multiplication result; applying an inverse of the isomorphism to the multiplication result to obtain the multiplication of the first polynomial and the second polynomial; and mapping the k inverted polynomials to a single polynomial in the ring mapping the k inverted polynomials to a single polynomial in the ring custom-character.sub.q[X]/(X.sup.n+1.

COMBINING REGULAR AND SYMBOLIC NTTS USING CO-PROCESSORS

Various embodiments relate to a method for multiplying a first and a second polynomial in a ring custom-character.sub.q[X]/(X.sup.n+1) to perform a cryptographic operation in a data processing system where q is a positive integer, the method for use in a processor of the data processing system, comprising: receiving the first polynomial and the second polynomial by the processor; mapping the first polynomial into k smaller third polynomials over k smaller rings based upon primitive roots of unity, where k is a positive integer; mapping the second polynomial into k smaller fourth polynomials over the k smaller rings based upon primitive roots of unity; applying an isomorphism to the k third polynomials resulting in k fifth polynomials; applying the isomorphism to the k fourth polynomials resulting in k sixth polynomials; applying a Kronecker substitution on the k fifth polynomials and the k sixth polynomials and perform the multiplication of the k fifth polynomials and the k sixth polynomials to produce a multiplication result; applying an inverse of the isomorphism to the multiplication result to obtain the multiplication of the first polynomial and the second polynomial; and mapping the k inverted polynomials to a single polynomial in the ring mapping the k inverted polynomials to a single polynomial in the ring custom-character.sub.q[X]/(X.sup.n+1.

Signed multiplication using unsigned multiplier with dynamic fine-grained operand isolation

An N×N multiplier may include a N/2×N first multiplier, a N/2×N/2 second multiplier, and a N/2×N/2 third multiplier. The N×N multiplier receives two operands to multiply. The first, second and/or third multipliers are selectively disabled if an operand equals zero or has a small value. If the operands are both less than 2.sup.N/2, the second or the third multiplier are used to multiply the operands. If one operand is less than 2.sup.N/2 and the other operand is equal to or greater than 2.sup.N/2, the first multiplier is used or the second and third multipliers are used to multiply the operands. If both operands are equal to or greater than 2.sup.N/2, the first, second and third multipliers are used to multiply the operands.

Signed multiplication using unsigned multiplier with dynamic fine-grained operand isolation

An N×N multiplier may include a N/2×N first multiplier, a N/2×N/2 second multiplier, and a N/2×N/2 third multiplier. The N×N multiplier receives two operands to multiply. The first, second and/or third multipliers are selectively disabled if an operand equals zero or has a small value. If the operands are both less than 2.sup.N/2, the second or the third multiplier are used to multiply the operands. If one operand is less than 2.sup.N/2 and the other operand is equal to or greater than 2.sup.N/2, the first multiplier is used or the second and third multipliers are used to multiply the operands. If both operands are equal to or greater than 2.sup.N/2, the first, second and third multipliers are used to multiply the operands.

Small multiplier after initial approximation for operations with increasing precision
11579844 · 2023-02-14 · ·

In an aspect, a processor includes circuitry for iterative refinement approaches, e.g., Newton-Raphson, to evaluating functions, such as square root, reciprocal, and for division. The circuitry includes circuitry for producing an initial approximation; which can include a LookUp Table (LUT). LUT may produce an output that (with implementation-dependent processing) forms an initial approximation of a value, with a number of bits of precision. A limited-precision multiplier multiplies that initial approximation with another value; an output of the limited precision multiplier goes to a full precision multiplier circuit that performs remaining multiplications required for iteration(s) in the particular refinement process being implemented. For example, in division, the output being calculated is for a reciprocal of the divisor. The full-precision multiplier circuit requires a first number of clock cycles to complete, and both the small multiplier and the initial approximation circuitry complete within the first number of clock cycles.

Small multiplier after initial approximation for operations with increasing precision
11579844 · 2023-02-14 · ·

In an aspect, a processor includes circuitry for iterative refinement approaches, e.g., Newton-Raphson, to evaluating functions, such as square root, reciprocal, and for division. The circuitry includes circuitry for producing an initial approximation; which can include a LookUp Table (LUT). LUT may produce an output that (with implementation-dependent processing) forms an initial approximation of a value, with a number of bits of precision. A limited-precision multiplier multiplies that initial approximation with another value; an output of the limited precision multiplier goes to a full precision multiplier circuit that performs remaining multiplications required for iteration(s) in the particular refinement process being implemented. For example, in division, the output being calculated is for a reciprocal of the divisor. The full-precision multiplier circuit requires a first number of clock cycles to complete, and both the small multiplier and the initial approximation circuitry complete within the first number of clock cycles.

Neural network processor for handling differing datatypes
11580353 · 2023-02-14 · ·

Embodiments relate to a neural engine circuit that includes an input buffer circuit, a kernel extract circuit, and a multiply-accumulator (MAC) circuit. The MAC circuit receives input data from the input buffer circuit and a kernel coefficient from the kernel extract circuit. The MAC circuit contains several multiply-add (MAD) circuits and accumulators used to perform neural networking operations on the received input data and kernel coefficients. MAD circuits are configured to support fixed-point precision (e.g., INT8) and floating-point precision (FP16) of operands. In floating-point mode, each MAD circuit multiplies the integer bits of input data and kernel coefficients and adds their exponent bits to determine a binary point for alignment. In fixed-point mode, input data and kernel coefficients are multiplied. In both operation modes, the output data is stored in an accumulator, and may be sent back as accumulated values for further multiply-add operations in subsequent processing cycles.

Neural network processor for handling differing datatypes
11580353 · 2023-02-14 · ·

Embodiments relate to a neural engine circuit that includes an input buffer circuit, a kernel extract circuit, and a multiply-accumulator (MAC) circuit. The MAC circuit receives input data from the input buffer circuit and a kernel coefficient from the kernel extract circuit. The MAC circuit contains several multiply-add (MAD) circuits and accumulators used to perform neural networking operations on the received input data and kernel coefficients. MAD circuits are configured to support fixed-point precision (e.g., INT8) and floating-point precision (FP16) of operands. In floating-point mode, each MAD circuit multiplies the integer bits of input data and kernel coefficients and adds their exponent bits to determine a binary point for alignment. In fixed-point mode, input data and kernel coefficients are multiplied. In both operation modes, the output data is stored in an accumulator, and may be sent back as accumulated values for further multiply-add operations in subsequent processing cycles.