MULTIPART NUMERICAL ENCODING
20250373268 ยท 2025-12-04
Inventors
Cpc classification
H03M7/6047
ELECTRICITY
International classification
Abstract
A multipart encoded representation or data type may be particularly useful for limited-capacity processors, such as those used in at-memory or single-instruction, multiple data (SIMD) devices. To encode a source number, the source number is quantized as a first binary representation defined by a first set of exponent bits and a first set of mantissa bits to obtain a first encoded part. A difference between the source number and a nearest dequantization of the first encoded part is computed. The difference is quantized as a second binary representation defined by a second set of exponent bits and a second set of mantissa bits to obtain a second encoded part. The first encoded part and the second encoded part are stored as an encoded representation of the source number. A computational operation may be performed using the encoded representation. The encoded representation may be decoded using an inverse process.
Claims
1. A device comprising: one or more processors configured to collectively: quantize a source number as a first binary representation defined by a first set of exponent bits and a first set of mantissa bits to obtain a first encoded part; compute a difference between the source number and a nearest dequantization of the first encoded part; quantize the difference as a second binary representation defined by a second set of exponent bits and a second set of mantissa bits to obtain a second encoded part; store the first encoded part and the second encoded part as an encoded representation of the source number; and perform a computational operation using the encoded representation.
2. The device of claim 1, wherein: the first binary representation is further defined by a first exponent bias; the second binary representation is further defined by a second exponent bias; and the second exponent bias is selected based on a gap between quantized values of the first binary representation.
3. The device of claim 2, wherein the second exponent bias is selected based on a maximum gap between the quantized values of the first binary representation.
4. The device of claim 1, wherein: the first binary representation is further defined by a first sign bit; the second binary representation is further defined by a second sign bit that is the same as the first sign bit; and the difference is computed as a positive remainder.
5. The device of claim 1, wherein: the first binary representation is further defined by a first sign bit; the second binary representation is further defined by a second sign bit that is independent of the first sign bit; and the difference is computed as positive or negative to set the second sign bit.
6. The device of claim 1, wherein the one or more processors comprises: an array of single-instruction, multiple data (SIMD) processing elements; wherein each processing element of the array is configured to perform: a first multiply-accumulate with the first encoded part and a value; and a second multiply-accumulate with the second encoded part and the value.
7. The device of claim 1, wherein: the first set of exponent bits and the second set of exponent bits are equal in number.
8. The device of claim 1, wherein: the first set of mantissa bits and the second set of mantissa bits are equal in number.
9. The device of claim 1, wherein the difference is a first difference and the one or more processors are configured to collectively: compute a second difference between the first difference and a nearest dequantization of the second encoded part; quantize the second difference as a third binary representation defined by a third set of exponent bits and a third set of mantissa bits to obtain a third encoded part; store the first encoded part, the second encoded part, and the third encoded part as an encoded representation of the source number.
10. The device of claim 1, wherein: the source number is a fixed-point number or a floating-point number; and the first encoded part and the second encoded part are floating-points numbers, and the first encoded part and the second encoded part represent a quantized fixed-point number or a requantized floating-point number.
11. A device comprising: one or more processors configured to collectively: determine a first encoded part of an encoded representation of a source number; dequantize the first encoded part based on a first binary representation defined by a first set of exponent bits and a first set of mantissa bits to obtain a first decoded part; determine a second encoded part of the encoded representation of the source number; dequantize the second encoded part based on a second binary representation defined by a second set of exponent bits and a second set of mantissa bits to obtain a second decoded part; and add the second decoded part to the first decoded part to obtain a decoded number.
12. The device of claim 11, wherein: the first binary representation is further defined by a first exponent bias; the second binary representation is further defined by a second exponent bias; and the second exponent bias is selected based on a gap between quantized values of the first binary representation.
13. The device of claim 12, wherein the second exponent bias is selected based on a maximum gap between the quantized values of the first binary representation.
14. The device of claim 11, wherein: the encoded representation is unsigned; and the first decoded part and the second decoded part are positive.
15. The device of claim 11, wherein: the first binary representation is further defined by a first sign bit; the second binary representation is further defined by a second sign bit that is independent of the first sign bit; and the first decoded part and the second decoded part are each positive or negative.
16. The device of claim 11, wherein the one or more processors comprises: an array of single-instruction, multiple data (SIMD) processing elements; wherein each processing element of the array is configured to perform: a first multiply-accumulate with the first encoded part and a value; and a second multiply-accumulate with the second encoded part and the value.
17. The device of claim 11, wherein: the first set of exponent bits and the second set of exponent bits are equal in number.
18. The device of claim 11, wherein: the first set of mantissa bits and the second set of mantissa bits are equal in number.
19. The device of claim 11, wherein the one or more processors are configured to collectively: determine a third encoded part of the encoded representation of the source number; dequantize the third encoded part based on a third binary representation defined by a third set of exponent bits and a third set of mantissa bits to obtain a third decoded part; and add third encoded part, the second decoded part, and the first decoded part to obtain a decoded number.
20. The device of claim 11, wherein: the first encoded part and the second encoded part are represented as floating-point numbers; and the decoded number is a fixed-point number.
21. A device comprising: an array of single-instruction, multiple data (SIMD) processing elements; and a controller connected to the array of processing elements, wherein the controller is configured to: load a processing element of the array with an encoded representation of a source number, wherein the encoded representation is defined by first exponent and mantissa bits and second exponent and mantissa bits; control the processing element to perform a computational operation using the encoded representation; and output a result of the computational operation.
22. The device of claim 21, wherein the controller is further configured to: compute the encoded representation by performing a first quantization on the source number based on the first exponent and mantissa bits and by performing a second quantization on a remainder of the first quantization based on the second exponent and mantissa bits.
23. The device of claim 21, wherein the controller is further configured to: compute the result as a decoded representation by performing a first dequantization on a first part of the encoded representation based on the first exponent and mantissa bits, performing a second dequantization on a second part of the encoded representation based on the second exponent and mantissa bits, and combining results of the first and second dequantizations.
Description
BRIEF DESCRIPTION OF THE FIGURES
[0003]
[0004]
[0005]
[0006]
[0007]
[0008]
[0009]
[0010]
[0011]
DETAILED DESCRIPTION
[0012] Disclosed herein are data types (encoded representations), related methods, and devices operable with same. The techniques provided herein aim to improve the tradeoff between simplicity and precision. The techniques provided herein are particularly suited for an at-memory or single-instruction, multiple data (SIMD) hardware architecture, in which each of a multitude of individual processing elements is highly simplified and optimized. Rather than supporting large data types to improve precision, this disclosure provides smaller, combinable data types that allow the tailoring of precision to specific needs. Software developers are thus provided with more tools to better mitigate the negative effects of the tradeoff.
[0013]
[0014] The one or more processors 102 (generally processor) may include an array of SIMD processing elements, a microcontroller, a central processing unit (CPU), a field-programmable gate array (FPGA), an application-specific integrated circuit (ASIC), or a combination of such. The processor 102 cooperates with the medium 104 and may also include or cooperate with volatile memory, such as a random-access memory (RAM), to execute instructions 106.
[0015] The non-transitory machine-readable medium 104 may include an electronic, magnetic, optical, or other type of non-volatile physical storage device that encodes the instructions 106 that implement the functionality discussed herein. Examples of such storage devices include read-only memory (ROM), electrically-erasable programmable read-only memory (EEPROM), flash memory, a solid-state drive (SSD), a hard drive (HD), or a combination of such.
[0016] The instructions 106 may be directly executed, such as binary or machine code, and/or may include interpretable code, bytecode, source code, or similar instructions that may undergo additional processing to be executed. All of such examples may be considered executable instructions.
[0017] The computing device 100 may be configured to convert a source number 108, such as a real number that is represented as a floating-point number, to a multipart encoded representation 110, of which each part is a floating-point number. The computing device 100 may be configured to convert a multipart encoded representation 110 to a source number 108. In various examples, the computing device 100 is configured to perform both encoding and decoding. In other examples, a computing device 100 may be configured to perform one of encoding and decoding. The needs of a particular implementation may determine the encoding/decoding functionality provided.
[0018] It should be noted that, while the examples herein discuss multipart encoded representations with two parts, the techniques described herein are readily applicable to three, four, or more parts. Any practical number of parts may be used in various implementations, as would be readily understood by the person of ordinary skill in the art given the benefit of this disclosure. Accordingly, when terms such as two or both are used in the description and claims with reference to encoded or decoded parts, it should be kept in mind that two or more of such parts may be used.
[0019] With regard to encoding, the processor 102, as configured by the instructions 106, quantizes the source number 108 as a first encoded part 112. It is expected that the quantization likely leaves a remainder that could otherwise be lost. Rounding or truncation is used to obtain a suitable quantization for the first encoded part 112. The processor 102 computes this remainder or difference 114 between the source number 108 and a dequantization of the first encoded part 112. To do this, the first encoded 112 part may be dequantized and the difference 114 with the source number 108 may be computed by subtraction.
[0020] The processor 102 is further configured to quantize the difference 114 as a second encoded part 116. Quantization of the second encoded part 116 uses a higher-precision but smaller scale than quantization of the first encoded part 112, so that the difference 114, which is the non-quantized remainder from the first quantization, is suitably quantized.
[0021] The scale and precision of the encoded parts 112, 114 may be defined by respective sets of exponent and mantissa bits, as well as an exponent bias, as will be discussed in detail below.
[0022] The processor 102 may repeat the above process and compute a second difference between the first difference 114 and a nearest dequantization of the second encoded part 116, and then quantize the second difference to generate a third encoded part, and so on.
[0023] The processor 102 is configured to then store the first encoded part 112 and the second encoded part 116 (and the third encoded part, and so on, if used) as a multipart encoded representation 110 of the source number 108.
[0024] The processor 102 may then perform a computational operation using the multipart encoded representation 110. Example operations include addition, multiplication, and multiply-accumulate operation.
[0025] Regarding decoding a multipart encoded representation 110, which may represent a resultant of a computational operation, this may include the processor 102 performing the inverse of the above functions. This may include identifying first and second encoded parts 112, 116 in the multipart encoded representation 110, dequantizing the encoded parts 112, 116, and adding the dequantized values together to obtained a decoded number. Decoding is discussed further below.
[0026]
[0027] At block 202, a source number 108 is encoded as a first binary representation defined by a first set of exponent bits, a first set of mantissa bits, and a first exponent bias to obtain a first encoded part 112. The first binary representation may also include a first sign bit.
[0028] Bias is a tunable value that is subtracted from the exponent. This allows scaling (by powers of 2) the range of values represented by a binary representation.
[0029] Thus, a binary representation may be defined as a tuple P=(s, e, m, b) of quantization parameters: s: sign, e: exponent, m: mantissa, and b: bias. Quantized values are thus qQ.sub.P={0, . . . , 2.sup.s+e+m1}.
[0030] At block 204, the first encoded part 112 is dequantized to the nearest value, and the difference 114 between the source number 108 and the dequantization of the first encoded part 112 is computed.
[0031] Dequantization of a value q to a real value rR.sub.P, which is represented as a floating-point number in a computing device, may be performed with the following dequantize function (Python):
TABLE-US-00001 def dequantize(q: Q.sub.P) > R.sub.P: # extract bits from quantized float sign = (q >> e >> m) & s expo = (q >> m) & ((1 << e) 1) mant = (q >> 0) & ((1 << m) 1) # decode sign sign = 1 if sign == 1 else +1 if expo == 0: # subnormal case: exponent becomes 1 expo = 1 else: # normal case: append leading 1 to mantissa mant |= 1 << m # dequantize to float return sign * mant * 2 ** (expo b)
[0032] At block 206, the difference 114 is quantized as a second binary representation defined by a second set of exponent bits, a second set of mantissa bits, and a second exponent bias to obtain a second encoded part 116. The second exponent bias is selected based on a gap, such as a maximum gap, between quantized values of the first binary representation. The second binary representation may also include a second sign bit.
[0033] The number of bits used for the first and second exponents may be equal or different. Likewise, the number of bits used for the first and second mantissas may be equal or different. The quantities of the first and second exponent bits and the first and second mantissa bits may be selected according to implementation needs.
[0034] At block 208, the first encoded part 112 and the second encoded part 116 are stored as an encoded representation 110 of the source number 108.
[0035] At block 210, a computational operation, such as an addition, multiplication, or multiply-accumulate, is performed using the encoded representation 110.
[0036]
[0037] At block 302, a first encoded part 112 of an encoded representation 110 of a source number 108 (
[0038] At block 304, the first encoded part 112 is dequantized based on the first binary representation defined by a first sign bit (if used), a first set of exponent bits, a first set of mantissa bits, and a first bias to obtain a first decoded part 312.
[0039] At block 306, a second encoded part 116 of the encoded representation 110 of the source number 108 (
[0040] At block 308, the second encoded part 116 is dequantized based on the second binary representation defined by a second sign bit (if used), a second set of exponent bits, a second set of mantissa bits, and a second bias to obtain a second decoded part 314.
[0041] At block 310, the second decoded part 314 and the first decoded part 312 are combined, i.e., added, to obtain a decoded number 316.
[0042]
[0043] The quantized real numbers R.sub.P can be partitioned into sets L.sub.k (0k<2.sup.e), each of which contains the elements with exponent equal to k. Each such set has been assigned a symbol in the figure (circle, square, diamond, and X). It should be apparent that density doubles as k.fwdarw.0, with the exception of the k=0 case, which are the subnormals that linearly span the gap between 0 and the smallest normal number.
[0044] Quantizing a real number r may be performed by inverting the dequantize function (above). This may be done by rounding, i.e., a process equivalent to selecting qQ.sub.P such that |r-dequantize (q)| is minimized.
[0045]
[0046] With a first binary representation defined by quantization parameters P.sub.0=(s.sub.0, e.sub.0, m.sub.0, b.sub.0), gaps between real values R.sub.P.sub.
[0047] Sets of useful quantization parameters (P.sub.0, P.sub.1) may be predetermined based on operational needs.
[0048] In the example shown in
[0049] With a first binary representation defined by quantization parameters P=(0,2,2,2), a second binary representation may be defined by quantization parameters (0,2,2, b.sub.1) with a bias b.sub.1=2+(2.sup.22.sup.2)+(2+1)=5. This leads to the dual representation shown in
[0050]
[0051] A signed dual representation may be a symmetrical version of the unsigned case (
[0052] More efficient use of the second binary representation q.sub.1 may be realized by using a round-to-nearest operation for the first quantization quantize.sub.0. Thus, instead of the remainder r.sub.1 being limited to a positive gap, i.e., 0r.sub.1<G.sub.0, when the source number r.sub.0 is positive and being limited to a corresponding negative gap when the source number r.sub.0 is negative, the remainder r.sub.1 is limited to positive and negative values spanning an equivalent gap, i.e.,
The sign of the remainder r.sub.1 (and its quantized value q.sub.1) is thus independent of the sign of the source number being encoded r.sub.0 (and its quantized value q.sub.0).
[0053] In addition, when allowing for positive and negative remainders, the value of the remainder r.sub.1 is capped at half the size at it would be for only positive (or negative) values. Accordingly, the range of P.sub.1 may be reduced by a factor of two. This may be effected by a suitable selection of the bias for the second quantization, i.e., b.sub.1=b.sub.0+(2.sup.e.sup.
[0054]
[0055] The computing device 700 includes an array of processing elements or PEs 702. Processing elements 702 may be logically and, optionally, physically arranged in a two-dimensional array. Such an array may be considered to have rows and columns.
[0056] Each processing element 702 includes registers, an arithmetic logic unit (ALU), and other circuitry configured to perform computational operations.
[0057] Each processing element 702 includes or is connected to working memory dedicated to that processing element 702. A processing element 702 may be connected with one or more neighboring processing elements 702 to share information. Processing element interconnections may be provided in the row direction, the column direction, or both.
[0058] The computing device 700 further includes a controller 706 connected to a subset of processing elements 702 (e.g., a row or column of PEs). The controller 706 controls the connected processing elements 702 to perform the same operation on data contained in each processing element 702. The controller 706 may further control loading/retrieving of data to/from the processing elements 702, control the communication among processing elements 702, and/or control other functions for the processing elements 702. Any suitable number of controllers 706 may be provided to control the processing elements 702. Controllers 706 may be connected to each other for mutual communications.
[0059] The computing device 700 may further include a bus for communications among the controllers 706 and/or processing elements 702, direct memory access hardware to share information between memory of the processing elements, and other components (not shown).
[0060] The computing device 700 may be connected to a host system 708 that provides a program 710 to the computing device 700 and that expects output of the program during and/or after execution by the computing device 700. Such connection may be between the host system 708 and the controller(s) 706. The host system 708 may also provide a user interface and other components to support operations of the computing device 700. The host system 708 may be a conventional computing device, such as a desktop/notebook computer, server, smartphone, or vehicle-based computer.
[0061] The controller 706 is a processor (e.g., microcontroller, etc., see above) that may be configured with instructions to control the processing elements 702 to operate with encoded representations of numbers, as discussed above.
[0062] The controller 706 loads one or more processing elements 702 with an encoded representation 110 of a source number 108. The encoded representation 110 is defined as discussed above. The encoded representation 110 may be loaded into a single processing element 702 (see
[0063] Prior to loading an encoded representation 110 into a processing element 702, the controller 706 may compute the encoded representation 110 by performing a first quantization on the source number 108 to obtain a first part 112 and by performing a second quantization on a remainder of the first quantization to obtain a second part 116, as discussed above. Alternatively, the host system 708 may be configured to compute the encoded representation 110 from the source number 108.
[0064] The controller 706 controls the processing element 702 to perform a computational operation using the encoded representation 110. Example operations include addition, multiplication, and a multiply-accumulate operation. For instance, the controller 706 may command the processing element to perform a first multiplication with the first encoded part 112 and a value 712 and perform a second multiplication with the second encoded part 116 and the value 712. The results of the first and second multiplications may be combined (e.g., added) to obtain a total result 714.
[0065] The result 714 may be an intermediate result of a multiply-accumulate operation. The multiplication process mentioned above may be repeated with first and second parts 112, 116 of the intermediate result 714 being multiplied by another value 712 to obtain another result 714, which may be yet another intermediate result 714, in which case another value 712 is used for another multiplication, or a final result 714.
[0066] The controller 706 may output a result 714 of the computational operation. Output may be fed back into the computing device 700 and/or provided to the host 708 as, for example, output of the program 710.
[0067] Prior to outputting the result 714, the controller 706 may decode the result 714 (see
[0068] The controller 706 may perform or facilitate the above operations simultaneously or near simultaneously with a multitude of the processing elements 702.
[0069] In summary, depending on implementation needs, the domain of multipart encoded representations 110 may be a) all of the host 708, the controllers 706, and the processing elements 702, where the host 708 converts between the encoded representation 110 and other data types; b) the controllers 706 and the processing elements 702, where the controllers 706 convert between the encoded representation 110 and other data types; or c) only the processing elements 702, which convert between the encoded representation 110 and other data types.
[0070]
[0071]
[0072] Regarding dynamic range of the techniques discussed herein, if D=(P.sub.0, P.sub.1)=((s.sub.0, e.sub.0, m.sub.0, b.sub.0), (s.sub.1, e.sub.1, m.sub.1, b.sub.1)) defines a dual data type as discussed herein and it is assumed that P.sub.1 has at least as much dynamic range as P.sub.0, then the maximum value of the dual data type is simply the sum of the maximum values of its constituents, i.e.:
[0073] The minimum positive value of a dual data type is the minimum of the minimum positive values of its constituents. By assumption, this is the second one, i.e., LO.sub.D=LO.sub.1=2.sup.1b.sup.
[0074] Then, the dynamic range is
[0075] As for precision, consider a random binary fixed-point number r.sub.0 in normalized form, i.e., with a leading one. Assuming it is neither clipped nor subnormal, quantizing it to q.sub.0 with P.sub.0 yields m.sub.0+1 bits of precision: the leading 1, followed by m.sub.0 mantissa bits. Since q.sub.0 is obtained via rounding, the first m.sub.0+2 bits of r.sub.1=r.sub.0-dequantize. (go) are guaranteed to be zero. Assuming a uniform distribution, the leading one in r.sub.1 will be at position m.sub.0+2+k with probability 2.sup.k, for k1. Summing the infinite series gives an expected location for the leading one of m.sub.0+4. Quantizing r.sub.1 to q.sub.1 with P.sub.1 will take this leading one, followed by an additional m.sub.1 mantissa bits. Thus, the expected precision of D is m.sub.0+m.sub.1+4.
[0076] In summary, it should be apparent from the above that techniques provided herein related to multipart encoded representation of numbers improve the tradeoff between simplicity and precision. At-memory or SIMD operations with high-precision operands may be simplified and optimized. The loss of precision or throughput caused by limited-capacity processing elements may be reduced or eliminated. The need to support data types with high bitlengths is reduced or eliminated. The disclosed combinable data types allow the tailoring of precision to specific implementations.
[0077] It should be recognized that features and aspects of the various examples provided above can be combined into further examples that also fall within the scope of the present disclosure. In addition, the figures are not to scale and may have size and shape exaggerated for illustrative purposes.