SQUARE ROOT CALCULATIONS ON AN ASSOCIATIVE PROCESSING UNIT
20230221925 · 2023-07-13
Inventors
Cpc classification
International classification
Abstract
A method for calculating a square root B having N bits of a number X having 2N bits includes iterating on bits b.sub.i of square root B starting from the most significant bit until the least significant bit of square root B. For each iteration, the method includes locating a 1 at the squared location of bit b.sub.i in a CHECK variable, determining the value of bit b.sub.i from the result of a comparison of number X with a function of all previously found bits and a previous comparison outcome, shifting all previously found bits right 1 location in a CHECK variable, and adding the determined value of bit b.sub.i into its squared location in the CHECK variable.
Claims
1. A method for calculating a square root B having N bits of a number X having 2N bits, the method comprising: iterating on bits b.sub.i of said square root B starting from the most significant bit until the least significant bit of square root B and for each iteration, performing the following; locating a 1 at a squared location of bit b.sub.i in a CHECK variable; determining a value of bit b.sub.i from a result of a comparison of number X with a function of all previously found bits and a previous comparison outcome; shifting all previously found bits right 1 location in said CHECK variable; and adding the determined value of bit b.sub.i into its squared location in said CHECK variable.
2. The method according to claim 1 wherein said CHECK variable is of length 2N and, wherein said determining uses only the relevant portion of said CHECK variable.
3. The method according to claim 1 wherein, in said locating, said squared location is two bits to the right of the current locations of all previously found bits in said CHECK variable.
4. The method according to claim 1 wherein said adding is implemented as an OR operation.
5. The method according to claim 1 which is implemented on an associative memory device.
6. The method according to claim 1 which is implemented on a CPU.
7. A square root calculator for calculating a square root B having N bits of a number X having 2N bits, the calculator comprising: a central processing unit (CPU) and; a memory array having a plurality of memory cells organized into rows and columns, the memory array having one row for a CHECK variable and a second row for a PREV variable, said PREV variable being aligned with said CHECK variable; said CPU to iterate on bits b.sub.i of said square root B starting from the most significant bit until the least significant bit of square root B and, for each iteration, said CPU to: locate a 1 at a squared location of bit b.sub.i in said CHECK variable; determine a value of bit b.sub.i from a result of a comparison of number X with a function of all previously found bits and a previous comparison outcome; shift all previously found bits right 1 location in said CHECK variable; and add the determined value of bit b.sub.i into its squared location in said CHECK variable.
8. The calculator according to claim 7 wherein said CHECK variable is of length 2N and, wherein said CPU uses only the relevant portion of said CHECK variable.
9. The calculator according to claim 7 wherein said squared location is two bits to the right of the current locations of all previously found bits in said CHECK variable.
10. A square root calculator for calculating a square root B having N bits of a number X having 2N bits, the calculator comprising: an associative processing unit (APU), said APU comprising: a memory array having a plurality of memory cells organized into rows and columns, the memory array having one row for a CHECK variable and a second row for a PREV variable, said PREV variable being aligned with said CHECK variable; a multiple row decoder to activate multiple rows at a time; and a controller to activate said multiple row decoder to iterate on bits b.sub.i of said square root B starting from the most significant bit until the least significant bit of square root B and, for each iteration, said controller to instruct the following operations: writing a 1 at a squared location of bit b.sub.i in said CHECK variable; determining a value of bit b.sub.i from a result of a comparison of number X with a function of all previously found bits and a previous comparison outcome; shifting all previously found bits right 1 location in said CHECK variable; and OR'ing the determined value of bit b.sub.i in its squared location with said CHECK variable.
11. The calculator according to claim 10 wherein said CHECK variable is of length 2N and, wherein said OR′ing uses only the relevant portion of said CHECK variable.
12. The calculator according to claim 10 wherein said squared location is two bits to the right of the current locations of all previously found bits in said CHECK variable.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0027] The subject matter regarded as the invention is particularly pointed out and distinctly claimed in the concluding portion of the specification. The invention, however, both as to organization and method of operation, together with objects, features, and advantages thereof, may best be understood by reference to the following detailed description when read with the accompanying drawings in which:
[0028]
[0029]
[0030]
[0031]
[0032]
[0033]
[0034]
[0035] It will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements for clarity. Further, where considered appropriate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements.
DETAILED DESCRIPTION OF THE PRESENT INVENTION
[0036] In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the invention. However, it will be understood by those skilled in the art that the present invention may be practiced without these specific details. In other instances, well-known methods, procedures, and components have not been described in detail so as not to obscure the present invention.
[0037] Applicant has realized that there is no need to consider the value of the entire guessed value, as the prior art does. Instead, Applicant has realized that very few powers of 2 are involved in the initial iterations and that this fact can be exploited to simplify the calculations to bit operations, such as shift and OR. As a result, the complexity of the square root computation for an N-bit result is reduced from O(N.sup.2) to O(N).
[0038] Applicant has also realized that the result of the current test calculation can be used in the next iteration when the resultant value of the current bit is 1 while the result of the previous current test calculation can be used when the resultant value of the current bit is 0.
[0039] Take for example, a 32-bit number X whose square root value B is a 16-bit number. Square root B can be represented in the following manner: [0040] b.sub.15b.sub.14b.sub.13b.sub.12b.sub.11b.sub.10b.sub.09b.sub.08b.sub.07b.sub.06b.sub.05b.sub.04b.sub.03b.sub.02b.sub.01b.sub.00
This can be written mathematically as:
B=Σ.sub.0.sup.15b.sub.i2.sup.i (2)
where the bit b.sub.i is 0 or 1.
[0041] Reference is now made to
[0042] A 32-bit number (also called a “vector”) is stored in 32 separate memory cells 22 of one of rows 20. While the memory cells 22 may not be adjacent, typically, a 32-bit number is stored in order, with the most significant bit (MSB) stored in the leftmost location of the set of memory cells 22 and the least significant bit (LSB) stored in the rightmost location. Since bit locations are numbered from 0 to 31, the MSB is stored in bit location 31 and the LSB is stored in bit location 0.
[0043]
[0044] For the 15.sup.th bit b.sub.15, variable PREV may store 32-bit number X whose square root is to be found and variable CHECK may initially be set to all 0's.
[0045] To find the value of the 15th bit b.sub.15, the check is simply to determine if the difference between the 32-bit number X and the square of 15.sup.th bit b.sub.15 is positive when b.sub.15 is set temporarily to 1. Mathematically this is written as:
X−(b.sub.152.sup.15).sup.2>=0 when b.sub.15 is 1 (3)
[0046] However, in accordance with the present invention, since b.sub.15 has been set to 1 and the square of 2.sup.15 is 2.sup.30, equation 3 can be rewritten as:
X−[(1*2.sup.30]>=0 (4)
[0047] In accordance with a preferred embodiment of the present invention, to subtract 1*2.sup.30 from X, CPU 10 may write a 1 in bit location 30 (which is the “squared location” of bit 15) of row 20b storing variable CHECK.
[0048] CPU 10 may then subtract the variable CHECK from the variable PREV to generate a TEST variable:
TEST=PREV−CHECK (5)
[0049] CPU 10 may then check if TEST is positive. If so, it may set b.sub.15 to 1 and, in accordance with the present invention, may set bit 30 of CHECK to 1 (leaving the remaining bits at their initialized values of 0) and may update variable PREV in row 20a to the value of variable TEST.
[0050] Otherwise, CPU 10 may set b.sub.15 to 0, may change bit 30 of variable CHECK back to 0, and may make no change to variable PREV since there was no change to bit 30.
[0051] CPU 10 may now operate to determine the value of the 14.sup.th bit, b.sub.14. To do so, it has to check if the difference between the 32-bit number X and the square of the sum of the 15.sup.th bit b.sub.15 (found previously) with the 14.sup.th bit b.sub.14 is positive when b.sub.14 is set temporarily to 1. Mathematically, this is written as:
X−(b.sub.152.sup.15+b.sub.142.sup.14).sup.2>=0 (6)
[0052] Using the identity (a+b)(a+b)=a.sup.2+2ab+b.sup.2 and setting 14.sup.th bit b.sub.14 temporarily to 1, we get:
X−(b.sub.152.sup.15).sup.2−[2(b.sub.152.sup.15)*(1*2.sup.14)+(1*2.sup.14).sup.2]>=0 (7)
[0053] Recall that the square of 2.sup.15 is 2.sup.30 and that variable PREV stores the variable TEST from the previous iteration which equaled X−b.sub.152.sup.30. Similarly, the square of 2.sup.14 is 2.sup.28. Also note that the 2ab term, 2(b.sub.152.sup.15)*(1*2.sup.14), is equivalent to (b.sub.152.sup.15)*(1*2.sup.15). Thus, equation 7 may be rewritten as:
PREV−[(b.sub.152.sup.15)*(1*2.sup.15)+(1*2.sup.28)]>=0 (8)
Which, when consolidated, becomes:
PREV−[(b.sub.152.sup.30+(1*2.sup.28)]>=0 (9)
[0054] In accordance with a preferred embodiment of the present invention, to implement equation 6, CPU 10 may write a 1 in bit location 28 (which is the squared location of bit 14) of row 20b storing variable CHECK. Note that, as shown in
[0055] The next steps may be similar to the previous iteration. CPU 10 may implement equation 5 (i.e., subtract variable CHECK from variable PREV to generate variable TEST).
[0056] CPU 10 may then check if variable TEST is positive. If so, it may set b.sub.14 to 1 and may update variable PREV in row 20a to the value of variable TEST. Otherwise, CPU 10 may set bit b.sub.14 to 0, and may make no change to variable PREV since there was no change to bit 28.
[0057] In accordance with a preferred embodiment of the present invention and in preparation for the next iteration described in more detail hereinbelow, in order to update variable CHECK, CPU 10 may first set location 28 back to 0, may shift variable CHECK right one bit, which may move bit b.sub.15 to location 29 (as shown in
[0058] CPU 10 may now operate to determine the value of the 13.sup.th bit, b.sub.13. To do so, it has to check if the difference between the 32-bit number X and the square of the sum of the 15.sup.th bit b.sub.15, the 14.sup.th bit b.sub.14, and the 13.sup.th bit b.sub.13, is positive when b.sub.13 is set temporarily to 1. Mathematically, this is written as:
X−(b.sub.152.sup.15+b.sub.142.sup.14±b.sub.132.sup.13).sup.2>=0 (10)
[0059] Equation 10 is more complex; however, Applicant has realized that variables PREV and CHECK from the previous iteration store useful information.
[0060] Using the identity (a+b)(a+b)=a.sup.2+2ab+b.sup.2, defining ‘a’ as the previous result (i.e., b.sub.152.sup.15 b.sub.142.sup.14), and setting 13.sup.th bit b.sub.13 temporarily to 1, we get:
X−[(b.sub.152.sup.15+b.sub.142.sup.14).sup.2][2(b.sub.152.sup.15+b.sub.142.sup.14)*(1*2.sup.13)+(1*2.sup.13).sup.2]>=0 (11)
[0061] Recall that that variable PREV stores the variable TEST from the previous iteration which equaled X−(b.sub.152.sup.15+b.sub.142.sup.14).sup.2, and that the square of 2.sup.13 is 2.sup.26. Also note that the 2ab term, 2(b.sub.152.sup.15+b.sub.142.sup.14)*(1*2.sup.13), is equivalent to (b.sub.152.sup.15+b.sub.142.sup.14)*(1*2.sup.14). Thus, equation 11 may be rewritten as:
PREV−[(b.sub.152.sup.29+b.sub.142.sup.28)+(1*2.sup.26)]>=0 (12)
[0062] The second term in Equation 12 will become the updated version of variable CHECK. However, note that its bit locations are as follows: b.sub.15 in bit location 29 (to where it was shifted in the last iteration, in preparation for equation 12), b.sub.14 in bit location 28 (where it was placed at the end of the last iteration) and a 1 temporarily in bit location 26 (i.e. the squared location of b.sub.13). Applicant has realized that this is the value of variable CHECK after the previous iteration, OR′ d with a 1 in bit 13's squared location.
[0063] The next steps may be similar to the previous iteration. CPU 10 may implement equation 5 (i.e., subtract variable CHECK from variable PREV to generate variable TEST).
[0064] CPU 10 may then check if variable TEST is positive. If so, it may set bit b.sub.13 to 1 and may update variable PREV in row 20a to the value of variable TEST. Otherwise, CPU 10 may set bit b.sub.13 to 0, and may make no change to variable PREV since there was no change to bit 26.
[0065] As in the previous iteration, in order to update variable CHECK, CPU 10 may first set location 26 back to 0, may shift variable CHECK right one bit, which may move bit b.sub.15 to location 28 and bit b.sub.14 to location 27, after which CPU 10 may add the determined value for bit b.sub.13 to location 26.
[0066] It will be appreciated that, each iteration i to determine bit b has to calculate the following:
PREV−[( . . . b.sub.i+22.sup.2i+3+b.sub.i+12.sup.2i+2)+(1*2.sup.2i)]>=0 (13)
where variable PREV exists from the previous iteration and the second term in Equation 13 is built from the previous version of variable CHECK. As shown in
[0067] In other words, the updated version of variable CHECK is the final version of variable CHECK from the previous iteration OR′ d with a 1 in the squared location of bit b.sub.i.
[0068] The method for most iterations is shown in
[0069] It will be appreciated that, when the process finishes, variable CHECK has shifted totally to the right and the 16 bits b of square root B have been determined.
[0070] It will be appreciated that the, as mentioned hereinabove, the addition operation can be replaced with an OR operation since the operands PREV and CHECK are disjoint since there are 2 bits between b.sub.i in its squared location (2.sup.2i), and the previously solved bits (from 2.sup.2i+2 on). In other words, the new bits in each iteration do not overlap old bits from previous iterations and therefore, there will never be a carry value.
[0071] It will further be appreciated that the OR operation reduces the complexity of each subtraction operation from O(N) to O(1).
[0072] Reference is now made to
[0073] In this embodiment, step 38 is followed by step 39, where CPU 10 adds a 1 to the (2i+1)th location since variable TEST is positive. Then CPU 10 shifts (step 41) variable RESULT right by 1 bit location. In step 43, CPU 10 updates variable CHECK to the value of variable RESULT. In this manner, at the end of the iterations, variable RESULT will be shifted from its wide state holding the number of bits in variable X (32 bits in the example above) to its square root state (16 bits in this example) and, in the square root state it will store the results (i.e., square root B).
[0074] Reference is now made to
[0075] Applicant has noted that, for the first 8 iterations (i.e., C15-C8), there is no data in the lower part of variable CHECK (i.e., in bits 15-0). Therefore, the lower bits do not have to be included in the subtraction operation until C7.
[0076] Similarly, for the final 8 iterations, there is no data in the upper part of variable CHECK (i.e., in bits 24-31), due to the shifting. Therefore, the higher 8 bits do not have to be included in the subtraction operation after C7. Instead, at this point, CPU 10 may include the lower bits (i.e., bits 0-15) and the middle bits (16-23) in the subtraction operation.
[0077] However, it is typically very difficult for a CPU, such as CPU 10, to perform bit-wise operations and thus, CPU 10 cannot easily include only the upper or only the middle and lower bits in a subtraction operation. Moreover, it is especially difficult for a CPU to do so on a plurality of values at one time.
[0078] Applicant has realized that the proposed method and system is particularly efficient when performed on an associative processing unit (APU), such as the Gemini, commercially available from GSI Technology Inc. As described in US patents, U.S. Pat. No. 8,238,173, entitled “Using Storage Cells to Perform Computation”, U.S. Pat. No. 9,418,719 entitled “In-Memory Computational Device”, and U.S. Pat. No. 9,558,812, entitled “SRAM Multi-Cell Operations”, all assigned to the common assignee of the present invention and incorporated herein by reference, the APU operates separately on each bit and thus, can easily operate on only some of the bits, such as only on the upper bits or only on the middle and lower bits of the variables PREV and CHECK. Furthermore, the APU operates on 32K values in parallel, and can thus perform addition and subtraction on selected bits of multiple numbers at the same time.
[0079] Reference is now briefly made to
[0080] A number to be operated upon is stored in one column 62 and there are typically 32K columns 62 storing 32K numbers. The cells 58 in a column 62 are connected by a bit line processor 64, connected to multiple column decoder 54, capable of enabling computation on its column 62. Each column 62 is divided into multiple sections 68 and typically, each section operates on a single bit of a multi-bit number stored in the column.
[0081] Unique to APUs, multiple row decoder 52 may activate multiple rows 60 concurrently and multiple column decoder 54 may activate multiple columns 62 concurrently, in each section 68. Decoders 52 and 54 are controlled by controller 56 to implement desired methods and algorithms. Each column 62 of each section 68 may perform the needed computation for a single bit and activating multiple columns of the same section 68 results in concurrent computation of the same bit but of multiple numbers.
[0082] Since each section 68 is separately operatable, APU 48 operates separately on each bit and thus, can select which sections of variables PREV and CHECK to operate at any time.
[0083] It will be appreciated that the method and system described hereinabove can be implemented on numbers, or vectors, with more than the standard 32 bits. Since most working spaces are divided into 16-bit partitions, then a calculation on a 48-bit vector may have three partitions—a high partition, a middle partition and a lower partition. At the beginning, controller 56 implements operations only on the high partition. After 16 bits, controller 56 adds the middle partition until the method no longer works on any bits in the high partition. At this point, controller 56 adds the lower partition. Thus, for a number that contains more than 32 bits, there is a point where the subtraction no longer operates on the high partition.
[0084] It will be appreciated that the present invention provides an efficient method for computing the square root of a number. Moreover, when implemented on an associative memory device, it may concurrently and efficiently compute the square root of multiple numbers.
[0085] Applicant has also realized that building the square root from the left side (using the square value of a guess) and shifting the result right in each iteration (to get the square value of the current guess) reduces the complexity of the square root computation for N-bit number X from O(N.sup.2) to O(N). In addition, since the operands are disjoint, APU 48 implements the addition operation with an OR operation, which is a single cycle operation on APU 48, as opposed to the addition operation which takes 12 cycles for 32K elements. This reduces the complexity of each addition from O(N) to O(1).
[0086] It will also be appreciated that, in APU 48, the update operation (i.e., steps 42, 44 and 46 of
[0087] While certain features of the invention have been illustrated and described herein, many modifications, substitutions, changes, and equivalents will now occur to those of ordinary skill in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the true spirit of the invention.