Squaring circuit
09684489 ยท 2017-06-20
Assignee
Inventors
Cpc classification
G06F2207/5523
PHYSICS
International classification
Abstract
Methods, apparatuses, and computer program products for squaring an operand include identifying a fixed-point value with a fixed word size and a substring size for substrings of the fixed-point value, wherein the fixed-point value comprises a binary bit string. A square of the fixed-point value can be determined using the fixed point value, the substring size, and least significant bits of the fixed-point value equal to the substring size.
Claims
1. An apparatus comprising: a squaring circuit configured to: identify a fixed-point value with a fixed word size and a substring size for substrings of the fixed-point value, wherein the fixed-point value comprises a string of digits and an initial operand; for an initial iteration, determine a square of least significant digits of the initial operand; for each subsequent iteration: determine an operand for that iteration by decatenating least significant digits from an operand in a previous iteration, wherein a length of the least significant digits from the operand of the previous iteration is equal to the substring size; and determine a square using the least significant digits of the operand for that iteration and the substring size; and concatenate the squares from each iteration to estimate a square of the fixed-point value.
2. The apparatus of claim 1, wherein each digit in the string of digits is in base 2.
Description
DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
DETAILED DESCRIPTION
(7) The present disclosure describes an iterative squaring technique that produces a 2 nm-bit length result, .sup.2, based on an input operand (often referred to as a squarand) of nm-bits in length. The circuit produces 2m bits of the output .sup.2 during each iterative step. By considering an m-bit grouping within the squarand as representing a single radix-2.sup.m digit, the circuit can be considered a digit-serial implementation that produces two m-bit digits per iteration.
(8) This digit-serial architecture may allow for a tradeoff between bit-serial and parallel architectures by allowing for the digit to be represented by m bits. Because 2m bits of the result are computed in each iterative step, varying m can yield more or less parallelism while inversely affecting required circuit area. Thus, a minimal or otherwise reduced area circuit can be realized when m is small (bit-serial for the case m=1) and a large parallel circuit results at the other extreme when m is set to the wordsize of the squarand. Designers may be able to choose an appropriate value of m such that performance requirements are met while minimizing or otherwise reducing the amount of circuitry required.
(9) Arithmetically, the technique assumes the squarand is represented as a higher-radix digit string where each digit is represented by an m-bit substring. Furthermore, the technique may yield two digits of output squared value during each iterative step; hence, a total of 2m bits of the squared result are computed at each iterative step.
(10)
(11) The device 102 includes a squaring module 106. The squaring module 106 (described in more detail in
(12) Network 104 facilitates wireless or wireline communication between device 102 and other devices. Network 104 may be all or a portion of an enterprise or secured network. In another example, network 104 may be a VPN between device 102 and other devices across a wireline or wireless link. Such an example wireless link may be via 802.11a, 802.11b, 802.11g, 802.11n, 802.20, WiMax, and many others. The wireless link may also be via cellular technologies such as 3GPP GSM, UMTS, LTE, etc. While illustrated as a single or continuous network, network 104 may be logically divided into various sub-nets or virtual networks without departing from the scope of this disclosure, so long as at least portion of network 104 may facilitate communications between senders and recipients of requests and results. In other words, network 104 encompasses any internal and/or external network, networks, sub-network, or combination thereof operable to facilitate communications between various computing components in system 100. Network 104 may communicate, for example, Internet Protocol (IP) packets, Frame Relay frames, Asynchronous Transfer Mode (ATM) cells, voice, video, data, and other suitable information between network addresses. Network 104 may include one or more local area networks (LANs), radio access networks (RANs), metropolitan area networks (MANs), wide area networks (WANs), all or a portion of the global computer network known as the Internet, and/or any other communication system or systems at one or more locations.
(13) The following notation may be used in the description of the digit-serial fixed-point squaring algorithm:
(14) represents the radix or base of a number system. may be in the set of natural numbers, .
(15) The radix polynomial form of a value a is written as an n-term polynomial of the form:
=a.sub.n1.sub.n1+a.sub.n2.sub.n2+ . . . +a.sub.2.sub.2+a.sub.1.sub.1+a.sub.0.sub.0
(16) A value can also be represented in the radix- number system in the form of a positional string of n characters denoted by =[a.sub.n1 a.sub.n2 . . . a.sub.2 a.sub.1 a.sub.0]. For clarity, the character strings denoting the positional digit representations of a value may be enclosed by square brackets. The digits a.sub.i are the coefficients of the radix-polynomial form and their position within the string inherently denotes the exponent of the radix .
(17) Each character a.sub.i in a positional string representing a value is referred to as a digit regardless of the radix of the number system. Binary digits may alternatively be referred to as bits.
(18) Digits are restricted to the natural numbers when 10, and are members of the set:
{a.sub.i|0a.sub.i1}.
For the case where >10, alternative single characters are used to represent a digit such as the characters A through F for the case of =16.
(19) Where necessary for clarity, digit strings are subscripted by the radix of the particular number system being used, =[a.sub.n1 a.sub.n2 . . . a.sub.2 a.sub.1 a.sub.0].sub..
(20) LSD(,k) and MSD(,k) are operators that yield k least significant or most significant digits, respectively, in the digit string representing a value . LSD(,1) represents the least significant digit of , LSD(,1)=a.sub.0. Likewise the most significant digit is given as MSD(,1)=a.sub.n1.
(21) {A,B,C} denotes concatenation of the content of registers A, B, and C which can be of any size and whose individual sizes may differ.
(22) SHL(A,k,B) denotes the operation of shifting the content of register A to the left by k bits and setting the least significant k bits to the content of register B. A can be of any size greater than or equal to the size of B and B must be of size k.
(23) SHR(A,k,B) denotes the operation of shifting the content of register A to the right by k bits and setting the most significant k bits of A to the content of register B. A can be of any size greater than or equal to the size of B and B must be of size k.
(24) AB denotes the operation of setting the content of register A with that of register B. A and B can be the same size in some implementations.
(25) The radix- value A is defined as A=a.sub.0. Expressed as a positional n-digit string:
A=[a.sub.n1a.sub.n2 . . . a.sub.2a.sub.10].sub..
Thus, A can be formed by replacing LSD(,1)=a.sub.0 with the zero digit [0].sub. or as:
A={SHR([0].sub.,1,[0].sub.),[0].sub.}.
(26) The present disclosure describes a circuit and algorithm such that the choice of radix allows for a trade-off in logic circuit area versus throughput performance in the computation of .sup.2 when is represented as a binary bit string. Higher values of allow more bits to be produced per iterative step in the resulting representation of .sup.2. A tradeoff occurs in that the amount of computation or logic required at each iterative step increases for higher radix values.
(27) In the basis of the algorithm as stated here, it is assumed that the squarand is of the form of a binary bit string. Intermediate computations can be efficiently implemented when the radix is in the form =2.sup.m where m is a positive integer m2. Efficiency results since =2.sup.m allows each higher radix digit in the string representing to be equivalent to an m-bit substring within . , in terms of a higher-radix digit string, is simply the concatenation of the disjoint m-bit substrings of in binary form where LSD(,1) is the least significant m bits, the subsequent next significant higher-radix digit is represented by the next group of m bits to the left of LSD(,1), and so on.
(28) For convenience in specifying the basis of the algorithm, Equation (1) can be written with the restriction that =2.sup.m and some of the individual terms on the right-hand side of the equation can be denoted as T.sub.1, T.sub.2, and T.sub.3. .sup.2 can be written as:
(29)
The terms T.sub.1, T.sub.2, and T.sub.3 are explicitly defined as follows:
(30)
(31) The idea behind the algorithm may be to compute terms T.sub.1, T.sub.2 and T.sub.3 during each iterative step and accumulate them with the previous result. Subsequent iterations use A/ from the (A/).sup.2 term in Equation (1) as a squarand. The subsequent operand A/ for each iterative step is a digit string containing one less digit than the squarand in the previous step indicating that the iterative algorithm requires O(n/m) iterations to complete. The 2.sup.2m shifting factor of the first term in Equation (1) illustrates the fact that two digits (2m length bitstrings) are produced at each step and they represent digits in .sup.2 that are produced in the order of the lesser significant digits first.
(32) Several observations may be used to more efficiently implement the computation of the three terms T.sub.1, T.sub.2, and T.sub.3 in the squaring algorithm. First, the term A/ may be efficiently obtained by shifting the digit string representing one position to the right and discarding a.sub.0, A/=[a.sub.n1 a.sub.n2 . . . a.sub.2 a.sub.1].sub.. Second, values that are multiplied by a factor of =2.sup.km may be easily obtained by shifting the value to left by km bit positions and inserting a radix- zero digit place holder [0].sub. for the vacated least significant digits. Third, the term /2 is always of the form of a single radix- digit. Expressed as an m-bit binary string /2=[10 . . . 0].sub.2. Finally, the term (/2).sup.2 is always of the form of two radix- digits with the most significant digit of value /4 and the least significant digit of value zero. Hence, expressed as a 2m-bit binary string, (/2).sup.2=[010 . . . 0].sub.2.
(33) Term T.sub.1 can be computed in a single operation. Making use of the first and second observations, the value (A/)2.sup.2m is obtained by forming the digit string [a.sub.n1 a.sub.n2 . . . a.sub.2 a.sub.100].sub.. Furthermore, based on the fourth observation, T.sub.3=(/2).sup.2 can always be expressed as two radix-2.sup.m digits (2m bits) denoted as [q.sub.1q.sub.0].sub.. Thus, T.sub.1 is obtained by forming the string [a.sub.n1 a.sub.n2 . . . a.sub.2 a.sub.1 q.sub.1 q.sub.0].sub.. From the fourth observation, q.sub.1=/4 and q.sub.0=0 so that (/2).sup.2=[q.sub.1 q.sub.0].sub.=[(/4)0].sub.. Thus, the digit string representation for T1 is [a.sub.n1 a.sub.n2 . . . a.sub.2 a.sub.1(/4)0].sub..
(34) Term T2 is computed by first forming a digit string representing 2(A+/2) and then multiplying this string with the single radix- digit b. Relying on the first, second, and third observations, A=a.sub.n1 a.sub.n2 . . . a.sub.2 a.sub.1 0].sub. and /2 may be represented as a single unsigned radix-2.sup.m digit (m-bit string). Therefore, (A+/2)=[a.sub.n1 a.sub.n2 . . . a.sub.2 a.sub.1 /2]. To account for the multiplicative factor of 2, the (A+/2)=[a.sub.n1 a.sub.n2 . . . a.sub.2 a.sub.1(/2)].sub. digit string is then shifted by one bit position to the left resulting in 2(A+/2). The multiplicative factor 2 would in general be implemented through the use of an addition operation, 2(A+/2)=(A+/2)+(A+/2), when a higher-valued radix is used that is not an integral power of two since this can be considered a fractional digit shift, if 2.sup.m.
(35) The final step in the formation of term T.sub.2 involves the multiplication of 2(A+/2)=[a.sub.n1 a.sub.n2 . . . a.sub.2 a.sub.1(/2)].sub. by the signed single radix-2.sup.m digit of b=a.sub.02. Because b is a single digit value, this multiplication may be accomplished with a minimal or reduced amount of computation or circuitry as compared to a general purpose multiply operation or circuit. Clearly, as the value m is increased resulting in a higher valued radix, 2.sup.m, both computational complexity and overall algorithm throughput may increase. The actual implementation of the multiplication by b may be dependent upon the value m and may be carefully considered for a given realization of the algorithm. Relatively small values of m generally allow for a simple logic circuit or lookup table to be used.
(36) Term T.sub.3=b.sup.2 relies on the computation of the square of the residual value b. The implementation of this computation may also be dependent upon the size of m, which dictates the number of bits required to represent a radix-2.sup.m digit. For smaller values of m, the direct calculation of b.sup.2 can be very efficiently implemented as a small combinational logic circuit or through a lookup table. As m increases, the computation of b.sup.2 becomes more complex and other methods may be employed.
(37) For large values of m, the computation of T.sub.3b.sup.2 can be accomplished in parallel with the computation of the other two terms T.sub.1 and T.sub.2 since accumulation of T.sub.1+T.sub.2+T.sub.3 with overall result can occur at the end of each iterative step.
(38) After terms T.sub.1, T.sub.2, and T.sub.3 are formulated, they are summed together and accumulated with the previous result. The accumulation takes into account the process of multiplying subsequent iterative operands by 2.sup.2m and the fact that two independent radix- digits (or, 2m bits) of the final result are produced at each iterative step. This can be implemented in a variety of ways, including using registers. The size of the register may be 2 nm bits where n is the number of radix- digits representing a and m denotes the radix. The final operation of each iterative step of the algorithm is to shift the result register 2m bits to the right and insert the 2m least significant bits of T.sub.1+T.sub.2+T.sub.3 into the most significant positions of the shifted result register. Insertion of the two radix-2.sup.m digits in the most significant portion of the result register instead of performing a multi-bit left shift before adding them to the previously accumulated result allows the algorithm to be implemented without the need for an inclusion of a multi-bit left shift operation or the use of a barrel shifting circuit in a hardware realization.
(39) The algorithm uses an iteration index i to determine if all digits of the squarand have been produced. For an n-digit radix- squarand, the squared result consists of 2n digits. Because two digits are produced per iterative step, the index i ranges from zero to (n/2)1. Initially, when i=0, is the original squarand. During intermediate computations, when 0<i<n/2, the algorithm iterates and sets the intermediate squarand =A/. In the final iterative step, the squarand argument becomes =0; however, this step is performed to account for circumstances when the residual b is not zero-valued.
(40) Any given implementation of the algorithm should include careful consideration of the manner in which the signed digit b is represented. When explicitly represented using a radix-complement or a signed-magnitude form, m+1 bits are required to account for the sign. Furthermore, depending upon the definition of the residual, b can take on integer values in either of the ranges [(/2),(/21)] (as is the case in this formulation) or [(/2)+1,(/2)]. However, because there is a one-to-one relationship between the a.sub.0 and b values (since b=a.sub.0/2), the m-bit string representing a.sub.0 can be used as an encoding for the corresponding b value.
(41) The algorithm formulated in the previous section makes use of several registers. For succinctness, the registers used within the algorithm statement are defined in Table 1, shown below:
(42) TABLE-US-00001 TABLE 1 Registers Used in Squaring Algorithm name size (bits) content AB (n 1)m A/ R 2nm .sup.2 i log.sub.2(n/m) iteration matrix B m residual b encoded as LSD(, 1) ACC 2 nm T.sub.1 + T.sub.2 + T.sub.3 T1 2 nm T.sub.1 T2 2 nm T.sub.2 T3 2 nm T.sub.3 B2 m /2 B4 2 m (/2).sup.2 = (.sup.2/4) = [(/4)0].sub.
A statement of the algorithm is given below. Intermediate locations within the algorithm are denoted by labels in the form STEP k. The labels are included for convenience in referring to certain portions of the algorithm and they also indicate clock boundaries in that the results of STEP k1 are registered before computation occurs in STEP k. As an example, the T2{AB,B2} operation of STEP 3 must complete before the T2SHL(T2,1,[0]2) operation of STEP 4 can proceed. Breaking up the computation of term T2 into multiple intermediate registered operations is an example of pipelining the datapath and allows for the overall circuit clock speed to be increased. The steps are described below:
(43) TABLE-US-00002 INPUT: : nm-bit fixed-point squarand m: log.sub.2()-bit value, indicates working radix 2.sup.m OUTPUT: .sup.2: 2nm-bit value in register R STEP 1: i0 /* iteration index */ R0 /* initialize result register */ B2[10...0].sub.2 /* m-bits with MSB=1 */ B4[010...0].sub.2 /* 2m-bits with MSBs=01 */ AB /* squarand value */ STEP 2: BLSD(AB,m) /* encode b as LSD(AB,1) */ ABSHR(AB,m,[0..0].sub.2) /* MS squarand digits */ STEP 3: T1{AB,B4,[0..0].sub.2} /* form T1, m LSbs=0 */ T2{AB,B2} /* form A+/2 */ T3bb /* compute single digit square, uses a.sub.0 in B */ STEP 4: T2SHL(T2,1,[0].sub.2) /* form 2(A+/2) */ ACCT1+T3 /* form T.sub.1+T.sub.3 */ STEP 5: T2T2b /* form 2(A+/2)b, uses a.sub.0 in B */ STEP 6: ACCACC+T2 /* form T.sub.1+T.sub.2+T.sub.3 */ STEP 7: RSHR(R,2m,LSD(ACC,2)) /* update result */ ii+1 /* increment iteration counter */ STEP 8: if (i=n/m) /* check iteration count */ HALT /* computation complete */ else GO TO STEP 2
The example algorithm shown above can undergo n/m iterations producing 2m bits of .sup.2 during each iterative step. Therefore, the algorithm has temporal complexity equivalent to O(n/m). In terms of required computational resources, the algorithm requires circuitry to perform shifting, bit-string concatenation, 2 nm-bit operand addition, m2 nm-bit multiplication, and m-bit operand squaring. While 2 nm-bit operand addition operations are required in STEPs 4 and 6, it is noted that a single 2 nm-bit addition circuit can be used since these sums may be formed sequentially allowing for reuse of the single 2 nm-bit adder. The multiplication and single-digit squaring operations can be implemented in a variety of forms although it is noted that due to the relatively small size of the operands (m bits) very compact and fast circuits such as lookup tables are a practical choice.
(44)
(45) TABLE-US-00003 TABLE 2 Encoded Values of b for Radix-4 Number System a.sub.i b Encoded b in Register B 0 2 [00].sub.2 1 1 [01].sub.2 2 0 [10].sub.2 3 1 [11].sub.2
The computation of T2T2b in STEP 5 of the algorithm is accomplished by using a 4:1 multiplexer as a simple lookup structure with data paths of size 2 nm and an m-bit control signal driven by the content of register B. The idea behind this circuit is similar to that of the T3bb computation in STEP 3 with the important difference that the possible T2b values are computed during each iterative step rather than being precomputed and stored before circuit operation. Fortunately, these values are easily and efficiently computed since, for the quaternary implementation, they consist of the value 2(A+/2) multiplied by only one of b{2, 1, 0, 1}. Thus, a negated version of 2(A+/2)=[2(A+/2)] and a single-bit left-shifted shifted version of [2(A+/2)] are used as well as 2(A+/2) and [0 . . . 0] to drive the data inputs of the multiplexer.
(46) The output of the multiplexer 204 is received by combinational logic 206. Combinational logic 206 includes several outputs: one output is coupled to the input of the multiplexer 204. The other outputs of combinational logic 206 are coupled to an adder array 208. Each of the multiplexer 204, the combinational logic 206, and the adder array 208 also include as inputs control signals from a clocked synchronous controller (not shown).
(47) The combinational logic 206 may be implemented based on simplifications in the formation of the intermediate terms T.sub.1, T.sub.2, and T.sub.3, and their various sums. These simplifications exploit the choice of using =4 as an implicit operand radix and allow for the computation of the intermediate terms T.sub.1, T.sub.2, and T.sub.3 to be implemented with a reduced and simplified set of register transfer level (RTL) operations.
(48) A single quaternary digit [a.sub.k].sub.4 can, in general, be written as a two-bit binary string [b.sub.2k+1b.sub.2k].sub.2 where {b.sub.i} and
={0,1}. Using this definition, various intermediate terms and their sums can be evaluated for different cases of the least significant digit of the squarand, a.sub.0{0, 1, 2, 3}. Term T.sub.1 is independent of the value of a.sub.0 and is always a bit string of length 2n+2 expressed as:
T.sub.1[a.sub.n1a.sub.n2. . . a.sub.2a.sub.110].sub.4=[b.sub.2n1b.sub.2n2b.sub.2n3b.sub.2n4 . . . b.sub.5b.sub.4b.sub.3b.sub.20100].sub.2
Case 1
(49) a.sub.0=[0].sub.4 resulting in the residual b=[2].sub.4, thus T.sub.3=b.sup.2=[10].sub.4=[0100].sub.2. Term T.sub.2 can be expressed as:
(50)
Combining the terms:
(51)
Case 2
(52) a.sub.0=[1].sub.4 resulting in the residual b=[1].sub.4, thus T.sub.3=b.sup.2=[01].sub.4=[0001].sub.2. Term T.sub.2 can be expressed as:
(53)
Combining the terms:
(54)
Case 3
(55) a.sub.0=[2].sub.4 resulting in the residual b=[0].sub.4, thus T.sub.3=b.sup.2=[00].sub.4=[0000].sub.2. Term T.sub.2 can be expressed as:
(56)
Combining the terms:
(57)
Case 4
(58) a.sub.0=[3].sub.4 resulting in the residual b=[1].sub.4, thus T.sub.3=b.sup.2=[01].sub.4=[0001].sub.2. Term T.sub.2 can be expressed as:
(59)
For this case, the sum T.sub.2+T.sub.3 can be formed directly and it is subsequently combined with term T.sub.1 using the addition circuit. T.sub.2+T.sub.3 is formed as:
(60)
Table 3 below contains a summary of the results of the intermediate terms and their various sums in terms of values of the least significant digit of the operand at each iterative step.
(61) TABLE-US-00004 TABLE 3 Radix-4 Optimizations LS Intermediate D(.sub.4, 1) Term Value 0 T.sub.1 + T.sub.2 + T.sub.3 [0...0].sub.2 1 T.sub.1 + T.sub.2 + T.sub.3 [0b.sub.2n1 b.sub.2n2 . . . b.sub.3 b.sub.2 001].sub.2 2 T.sub.1 + T.sub.2 + T.sub.3 [b.sub.2n1 b.sub.2n2 . . . b.sub.3 b.sub.2 0100].sub.2 3 T.sub.1 [b.sub.2n1 b.sub.2n2 . . . b.sub.3 b.sub.2 0100].sub.2 3 T.sub.2 + T.sub.3 [0b.sub.2n1 b.sub.2n2 . . . b.sub.3 b.sub.2 101].sub.2
(62) The combinational logic 206 makes use of the results in Table 3 and outputs the two 2n+2 bit values that are summed in the adder array 208 resulting in T.sub.1+T.sub.2+T.sub.3 (i.e., the combinational logic includes two outputs: one for each input of the adder array). For the cases a.sub.0{0, 1, 2}, T.sub.1+T.sub.2+T.sub.3 is formed directly in the combinational logic 206 and is input to the adder array 208 on the leftmost input bus with the right-most input set to the 2n+2 bit string [00 . . . 00].sub.2. The adder array 208 is used for the case of a.sub.0=3, where the left-most input is the bit string [b.sub.2n1 b.sub.2n2 . . . b.sub.3 b.sub.20100].sub.2 and the right-most input is [0b.sub.2n1 b.sub.2n2 . . . b.sub.3 b.sub.2101].sub.2.
(63) Accumulator 210 consists of an internal accumulator register, an internal adder circuit, and a feedback loop that allows for the internal adder output to be stored in the internal accumulator register. Accumulator 210 can receive the output of the adder array 208 where it is added to the previously stored value in the accumulator register and then stored back into the accumulator register. A right shift register 212 can receive the output of the accumulator. The size of the register 212 may be 2 nm bits where n is the number of radix- digits representing a and m denotes the radix. The final operation of each iterative step of the algorithm is to shift the result register 2m bits to the right and insert the 2m least significant bits of T.sub.1+T.sub.2+T.sub.3 into the most significant positions of the shifted result register. Insertion of the two radix-2.sup.m digits in the most significant portion of the result register instead of performing a multi-bit left shift before adding them to the previously accumulated result allows the algorithm to be implemented without the need for an inclusion of a multi-bit left shift operation or the use of a barrel shifting circuit in a hardware realization. After the iterative steps are completed, the square .sup.2 214 can be output.
(64)
(65)
(66)
(67)
(68) A number of implementations have been described. Nevertheless, it will be understood that various modifications may be made including portions or the entirety of the implementation in software form. Accordingly, other implementations are within the scope of the following claims.