IPIQ

G06F5/012

METHOD AND APPARATUS WITH FLOATING POINT PROCESSING

20230042954 · 2023-02-09 ·

Samsung Electronics Co., Ltd.

A processor-implemented includes receiving a first floating point operand and a second floating point operand, each having an n-bit format comprising a sign field, an exponent field, and a significand field, normalizing a binary value obtained by performing arithmetic operations for fields corresponding to each other in the first and second floating point operands for an n-bit multiplication operation, determining whether the normalized binary value is a number that is representable in the n-bit format or an extended normal number that is not representable in the n-bit format, according to a result of the determining, encoding the normalized binary value using an extension bit format in which an extension pin identifying whether the normalized binary value is the extended normal number is added to the n-bit format, and outputting the encoded binary value using the extended bit format, as a result of the n-bit multiplication operation.

Quantizing neural networks using approximate quantization function

11494657 · 2022-11-08 ·

Perceive Corporation

Some embodiments of the invention provide a novel method for training a quantized machine-trained network. Some embodiments provide a method of scaling a feature map of a pre-trained floating-point neural network in order to match the range of output values provided by quantized activations in a quantized neural network. A quantization function is modified, in some embodiments, to be differentiable to fix the mismatch between the loss function computed in forward propagation and the loss gradient used in backward propagation. Variational information bottleneck, in some embodiments, is incorporated to train the network to be insensitive to multiplicative noise applied to each channel. In some embodiments, channels that finish training with large noise, for example, exceeding 100%, are pruned.

Floating point to fixed point conversion

11573766 · 2023-02-07 ·

Imagination Technologies Limited

Kenneth Rovers

A binary logic circuit converts a number in floating point format having an exponent E of ew bits, an exponent bias B given by B=2.sup.ew-1−1, and a significand comprising a mantissa M of mw bits into a fixed point format with an integer width of iw bits and a fractional width of fw bits. The circuit includes a shifter operable to receive a significand input comprising a contiguous set of the most significant bits of the significand and configured to left-shift the significand input by a number of bits equal to the value represented by k least significant bits of the exponent to generate a shifter output, wherein min{(ew−1),bitwidth(iw−2−s.sub.y)}≤k≤(ew−1) where s.sub.y=1 for a signed floating point number and s.sub.y=0 for an unsigned floating point number, and a multiplexer coupled to the shifter and configured to: receive an input comprising a contiguous set of bits of the shifter output; and output the input if the most significant bit of the exponent is equal to one.

TININESS DETECTION

20230035159 · 2023-02-02 ·

An apparatus comprises floating-point processing circuitry to perform a floating-point operation with rounding to generate a floating-point result value; and tininess detection circuitry to detect a tininess status indicating whether an outcome of the floating-point operation is tiny. A tiny outcome corresponds to a non-zero number with a magnitude smaller than a minimum non-zero magnitude representable as a normal floating-point number in a floating-point format to be used for the floating-point result value. The tininess detection circuitry comprises hardware circuit logic configured to support both before rounding tininess detection and after rounding tininess detection for detecting the tininess status.

SINGLE-CYCLE KULISCH ACCUMULATOR

20230092574 · 2023-03-23 ·

Huawei Technologies Co., Ltd.

Michael Dibrino

A processor to calculate a floating-point dot-product that receives a sequence of first and second floating-point numbers in which the sequence of the first and second floating-point numbers having a sign, a mantissa value and an exponent value. A floating-point unit determines the floating-point dot-product of the sequences by adding the exponent values to determine an exponent product, calculating a shift amount as a one's complement of a low exponent, multiplying the mantissas of the sequences to determine a product value of the mantissas, right shifting the product value of the mantissa by the shift amount to generate a shifted product, selecting segments of an accumulator based on a high exponent, and adding the selected segments to the shifted product to generate a sum. The sum is then written into the selected segments of the accumulator.

PERFORMING COMPARISON OPERATIONS USING EXTENDED EXPONENT RANGE FLOATING POINT VALUES

20220350566 · 2022-11-03 ·

A method and a processing module for performing a particular comparison operation using floating point values received in one or more input formats, The exponent range of the floating point values is extended. One or more of the following is performed: (a) a floating point value of zero is replaced with a non-zero substitute floating point value whose magnitude is small enough to behave like zero if all other values involved in the particular comparison operation are non-zero finite values in their input format; (b) one or more of the floating point values are shifted by a non-zero amount which is small enough to behave like zero if all other values involved in the particular comparison operation are non-zero finite values in their input format, wherein said non-zero amount is too small to be representable using the one or more input formats but is representable using the extended exponent range; and (c) a floating point value of infinity is replaced with a finite substitute floating point value whose magnitude is large enough to behave like infinity if all other values involved in the particular comparison operation are non-zero finite values in their input format, wherein said finite substitute floating point value has a magnitude that is too large to be representable using the one or more input formats but is representable using the extended exponent range.

PROCESSING-IN-MEMORY DEVICES HAVING MULTIPLE OPERATION CIRCUITS

20220342639 · 2022-10-27 ·

SK hynix Inc.

Choung Ki SONG

A processing-in-memory (PIM) device may include a plurality of memory banks configured to provide plural groups of weight data, a global buffer configured to provide plural sets of vector data, and a plurality of multiplication/accumulation (MAC) operators configured to perform MAC operations of the plural groups of weigh data and the plural sets of vector data. Each of the plurality of MAC operators includes a plurality of multiple operation circuits. Each of the plurality of multiple operation circuits is configured to perform an arithmetic operation in a first operation mode, a second operation mode, or a third operation mode according to first to third selection signals.

Cryptographic Computer Machines with Novel Switching Devices

20230125560 · 2023-04-27 ·

Peter Lablans

Operational n-state digital circuits and n-state switching operations with n and integer greater than 2 execute Finite Lab-transformed (FLT) n-state switching functions to process n-state signals provided on at least 2 inputs to generate an n-state signal on an output. The FLT is an enhancement of a computer architecture. Cryptographic apparatus and methods apply circuits that are characterized by FLT-ed addition and/or multiplication over finite field GF(n) or by addition and/or multiplication modulo-n that are modified in accordance with reversible n-state inverters, and are no longer known operations. Cryptographic methods processed on FLT modified machine instructions include encryption/decryption, public key generation, and digital signature methods including Post-Quantum methods. They include modification of isogeny based, NTRU based and McEliece based cryptographic machines.

ELECTRONIC DEVICE INCLUDING NEURAL PROCESSING UNIT SUPPORTING DIFFERENT DATA TYPES AND METHOD FOR CONTROLLING THE SAME

20230123312 · 2023-04-20 ·

An operational circuit may include a combiner to combine, based on a request for multiplication of integer numbers different from floating-point numbers, a first integer number and a second integer number. The operational circuit may include a multiplier including first and second ports. A third integer number may be inputted to the first port, and a fourth integer number indicating a combination of the first integer number and the second integer number may inputted to the second port. The operational circuit may include a converter to output, based on a fifth integer number indicating a multiplication of the third integer number and the fourth integer number from a third port of the multiplier, a sixth integer number indicating a multiplication of the first integer number and the third integer number, and a seventh integer number indicating a multiplication of the second integer number and the third integer number.

Constraint-based dynamic quantization adjustment for fixed-point processing

11630982 · 2023-04-18 ·

Cadence Design Systems, Inc.

Aspects of the present disclosure address systems and methods for fixed-point quantization using a dynamic quantization level adjustment scheme. Consistent with some embodiments, a method comprises accessing a neural network comprising floating-point representations of filter weights corresponding to one or more convolution layers. The method further includes determining a peak value of interest from the filter weights and determining a quantization level for the filter weights based on a number of bits in a quantization scheme. The method further includes dynamically adjusting the quantization level based on one or more constraints. The method further includes determining a quantization scale of the filter weights based on the peak value of interest and the adjusted quantization level. The method further includes quantizing the floating-point representations of the filter weights using the quantization scale to generate fixed-point representations of the filter weights.

Patent classifications

G06F5/012