Architecture for small and efficient modular multiplication using carry-save adders

11210067 · 2021-12-28

Assignee

Inventors

Cpc classification

International classification

Abstract

A computer processing system having at least one accelerator operably configured to compute modular multiplication with a modulus of special form and having a systolic carry-save architecture configured to implement Montgomery multiplication and reduction and having multiple processing element types composed of Full Adders and AND gates.

Claims

1. A computer processing system compromising: at least one accelerator operably configured to: compute modular multiplication with a modulus of the form, m=f.Math.2.sup.e−1, wherein f is any integer, e is any integer greater than 2, and m is k bits long; and have a systolic carry-save architecture configured to implement Montgomery multiplication and reduction bit-by-bit, have a processing element for the e least significant bits comprising of one Full Adder and one AND gate, and a processing element for the next significant k−e bits comprising of two Full Adders and two AND gates.

2. The computer processing system according to claim 1, wherein the Montgomery multiplication further comprises: a logical bit, q, computed by one Full Adder, two XOR gates, and three AND gates, wherein the q is operably configured to be stored and implemented in the processing element for the most significant k−e bits.

3. The computer processing system according to claim 2, wherein: the q is operably configured to be stored in the systolic carry-save architecture.

4. The computer processing system according to claim 1, wherein: the modulus is m=2.sup.2163.sup.137−1, m=2.sup.2503.sup.259−1, m=2.sup.3053.sup.192−1, or m=2.sup.3723.sup.239−1.

5. A computer processing system compromising: at least one accelerator operably configured to: compute modular multiplication digit-by-digit with a modulus of the form, m=f.Math.2e−1, wherein f is any integer, e is any integer greater than 2, m is k bits long, and a digit, d, is any integer greater than 1; and have a systolic carry-save architecture configured to implement Montgomery multiplication and reduction digit-by-digit, have a processing element for the CEILING(e/d) least significant digits comprising of d Full Adders and d AND gates, and a processing element for the next significant CEILING((k−e)/d) digits comprising of 2.Math.d Full Adders and 2.Math.d AND gates.

6. The computer processing system according to claim 5, wherein the Montgomery multiplication further comprises: a logical digit, q, computed by d Full Adders, 2.Math.d XOR gates, and 3.Math.d AND gates.

7. The computer processing system according to claim 5, wherein: the modulus is m=2.sup.2163.sup.137−1, m=2.sup.2503.sup.259−1, m=2.sup.3053.sup.192−1, or m=2.sup.3723.sup.239−1.

8. A computer processing system compromising: at least one accelerator operably configured to: compute modular multiplication with a modulus of the form, m=f.Math.2.sup.e+1, wherein f is any integer, e is any integer greater than 2, and m is k bits long; and have a systolic carry-save architecture configured to implement Montgomery multiplication and reduction bit-by-bit, have a processing element for the least significant bit comprising of two Full Adders and one AND gate, have a processing element for the next significant e−1 bits comprising of one Full Adder and one AND gate, and a processing element for the next significant k−e bits comprising of two Full Adders and two AND gates.

9. A computer processing system compromising: at least one accelerator operably configured to: compute modular multiplication digit-by-digit with a modulus of the form, m=f.Math.2.sup.e+1, wherein f is any integer, e is any integer greater than 2, m is k bits long, and a digit, d, is any integer greater than 1; and have a systolic carry-save architecture configured to implement Montgomery multiplication and reduction digit-by-digit, have a processing element for the least significant digit comprising of 2.Math.d Full Adders and d AND gates, have a processing element for the CEILING((e−1)/d) for the next significant digits comprising of d Full Adders and d AND gates, and a processing element for the next significant CEILING((k−e+1)/d) digits comprising of 2.Math.d Full Adders and 2.Math.d AND gates.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) The accompanying figures, where like reference numerals refer to identical or functionally similar elements throughout the separate views and which together with the detailed description below are incorporated in and form part of the specification, serve to further illustrate various embodiments and explain various principles and advantages all in accordance with the present invention.

(2) In general, these are drawn in a digital hardware design fashion. Each letter or line represents a single bit, or value between ‘0’ and ‘1’ to represent a digital circuit off or on, respectively. These diagrams compute modular multiplication, or a*b mod m using the Montgomery multiplication method as is shown in Algorithm 1. a represents input 1, b represents input 2, m represents the modulus, and p represents intermediate results. Subscripts s and c refer to carry-save adder notations sum and carry, respectively. q refers to a special quotient value that is unique to Montgomery multiplication. An AND gate is represented by half an ellipsoid, which is set if both inputs are set, similar to a simple multiplication. An XOR gate result is set if exactly one input is set.

(3) FA stands for Full Adder, which in carry-save adder circuits, computes s a XOR b XOR c_in and c_out (a AND b) OR (cin AND (a XOR b)). Here, a, b, and c_in are inputs and s and c_out are outputs. All gates and Full Adder computations can be performed on a variety of software computer processors and hardware transistors.

(4) FIGS. 1a-c are a reference for digital hardware notation. Since there are many ways to trivially rearrange inputs and outputs for a functionally equivalent circuit, we show multiple equivalent representations. The first row shows wire set a that is k total wires arranged from most significant wire (a(k−1)) to least significant wire (a(0)).

(5) The second row shows a concatenation of bits or wires (depending on the application) that represent input a(f:s). Each wire is given a label with respect to the value a. There are f−s wires here, with the order a(f:s), f represents the most significant bit or wire and s represents the least significant bit or wire. In the third row, we use the notation 0(f:s) to represent a f−s set of wires that are all logical value ‘0’. In the fourth row, we represent a concatenation of wire a and wire b with a small concatenation box. In the fifth row, we represent three wires {a,b,c} as a single wire, but this is equivalent the three wires in any orientation. In the sixth row, we perform a subset operation, by taking the smaller or equivalent set of wires a(fs) from the larger or equivalent set of wires a(y,x). In the seventh row, we illustrate the AND operation as either ab or with two inputs a and b pushed through an AND gate. In the eighth row, we illustrate a Full Adder gate with two inputs and two outputs. Since one input is ‘0’, this is functionally equivalent to a Half Adder gate. In the ninth row, we illustrate a Full Adder gate with one input and two outputs, which essentially feeds the value through. In the tenth row, we show a series of AND gates. Each wire of wire set b is pushed through an AND gate with wire a(i). Lastly, in the eleventh row, we show a series of Full Adders with inputs a, b, and c, and outputs g and h. There are k such wires in each input and output and k−1 total Full Adders.

(6) FIG. 2 is a schematic diagram depicting the systolic architecture to perform Montgomery multiplier with digit d=1 over some modulus. This represents the prior approach for carry-save adders. In the systolic architecture on the left side, there are 2k−1 Full Adders and 2k AND gates. The right side shows how to compute the next Montgomery quotient value q.

(7) FIG. 3 is a schematic diagram that depicts the connections for a processing unit for the invention's Montgomery multiplier with modulus=f.Math.2.sup.e−1 and digit d=1. This systolic architecture Montgomery multiplier on the left uses e fewer AND gates and e fewer Full Adders than the prior art. Note that the order of Full Adder, AND gates, or inputs/outputs can be swapped around a variety of ways and still be functionally identical. Additionally, note that designs using specific moduli can further simplify the Full Adders based on bits of the modulus. The right side utilizes two fewer Full Adders and two fewer AND gates than the prior art and has a smaller critical path from input to output.

(8) FIG. 4 is a schematic diagram that depicts the connections for a processing unit for the invention's Montgomery multiplier with modulus=f.Math.2.sup.e+1 and digit d=1. This Montgomery multiplier uses e fewer AND gates and e fewer Full Adders than the prior art. Note that the order of Full Adder, AND gates, or inputs/outputs can be swapped around a variety of ways and still be functionally identical. Additionally, note that designs using specific moduli can further simplify the Full Adders based on bits of the modulus. The next q circuit is the same as FIG. 2.

(9) FIGS. 5a-b are a schematic diagram depicting the systolic architecture to perform Montgomery multiplier with digit d=2 over some modulus. This represents the prior approach for Montgomery multiplication using carry-save adders. Many different arrangements of the Full Adders can be achieved for a functionally equivalent system.

(10) FIGS. 6a-b is a schematic diagram that depicts the connections for a processing unit for the invention's Montgomery multiplier with modulus=f.Math.2.sup.e−1 and digit d=2. This Montgomery multiplier uses 2-e fewer AND gates and 2.Math.e fewer Full Adders than the prior art. Note that the order of Full Adder, AND gates, or inputs/outputs can be swapped around a variety of ways and still be functionally identical. Additionally, note that designs using specific moduli can further simplify the Full Adders based on bits of the modulus.

(11) FIG. 7 is a schematic diagram that depicts the connections for a processing unit for the invention's Montgomery multiplier with modulus=f.Math.2.sup.e+1 and digit d=2. This Montgomery multiplier uses 2e fewer AND gates and 2e fewer Full Adders than the prior art. Note that the order of Full Adder, AND gates, or inputs/outputs can be swapped around a variety of ways and still be functionally identical. Additionally, note that designs using specific moduli can further simplify the Full Adders based on bits of the modulus. The next q circuit is the same as FIGS. 5a-b.

(12) FIGS. 8a-b is a schematic diagram depicting an expanded Montgomery multiplier with digit d=1 as shown in Sutter et al. (supra). This represents the prior art.

(13) FIGS. 9a-b is a schematic diagram that depicts an expanded processing unit for the invention's Montgomery multiplier with modulus=f.Math.2.sup.e−1 and digit d=1. Note that this Montgomery multiplier uses e fewer AND gates and e fewer Full Adders than FIGS. 1a-c in the systolic part. The next_q block also uses 1 fewer AND gate and 1 fewer XOR gate than FIGS. 1a-c, also resulting in a shorter critical path that does not even depend on the previous value of q. The logical bit, q, computed by one Full Adder, two XOR gates, and three AND gates

(14) FIGS. 10a-b is a schematic diagram depicts an expanded processing unit for the invention's Montgomery multiplier with modulus=f.Math.2.sup.e+1 and digit d=1. Note that this Montgomery multiplier uses an additional signal cin, or carry-in, to save e AND gates and e Full Adders than FIGS. 1a-c in the systolic part. The next_q block then swaps 1 XOR gate for 1 Full Adder when compared to FIGS. 1a-c.

DETAILED DESCRIPTION

(15) While the specification concludes with claims defining the features of the invention that are regarded as novel, it is believed that the invention will be better understood from a consideration of the following description in conjunction with the drawing figures, in which like reference numerals are carried forward. It is to be understood that the disclosed embodiments are merely exemplary of the invention, which can be embodied in various forms.

(16) The present invention provides a novel and efficient hardware, system, implementation, and method for efficiently implementing modular multiplication with modulus m of the form f.Math.2.sup.e±1 where f is a constant integer and e is a constant integer greater than 2. The modulus m is k bits long. To illustrate the broadness of the invention, we show different ways to show functionally equivalent circuits in FIGS. 1a-c.

(17) The Montgomery multiplication using carry-save adders for digit d=1 (also called radix in some cases) as was designed in Sutter et al. (supra), is shown in Algorithm 2 as well as FIG. 2. To implement Algorithm 2 in hardware or software, typically a basic logic block is used to show the simplest logical computations necessary.

(18) In FIG. 2, FA stands for Full Adder and computes s a XOR b XOR c_in and c_out (a AND b) OR (cin AND (a XOR b)). Here, a, b, and c_in are inputs and s and c_out are outputs. An AND gate, denoted by the bottom half of an ellipsoid is a two-input gate that outputs ‘1’ only if both inputs are ‘1’, otherwise the output is ‘0’. This AND gate is the same as a one bit multiplication. An XOR gate is a two-input gate that outputs ‘1’ if exactly one input is ‘1’, otherwise the output is ‘0’.

(19) FIG. 2, FIG. 3, FIG. 4, FIGS. 5a-b, FIGS. 6a-b, FIG. 7, FIGS. 8a-b, FIGS. 9a-b, and FIGS. 10a-b are all figures that illustrate different ways to perform modular multiplication in a hardware or software system. This is defined in a systolic architecture, which is composed of a multitude of processing elements that are interconnected. There can be a multitude of processing element compositions based on various levels of optimization. FIG. 2, FIG. 3, FIG. 4, FIGS. 5a-b, FIGS. 6a-b, and FIG. 7 perform Montgomery multiplication using a compressed systolic architecture format. For FIG. 2, FIG. 3, and FIG. 4, the general processing element is composed of two Full Adders and two AND gates. This invention focuses on simplifying a number of these general processing elements to smaller processing elements, notably down to one Full Adder and one AND gate when functionally possible. FIGS. 5a-b, FIGS. 6a-b, and FIG. 7 use a larger processing element as multiple iterations of Algorithm 2 are unrolled to perform the Montgomery multiplication. Again, the invention focuses on simplifying the complexity of these processing elements when functionally possible.

(20) When implementing Algorithm 2 in hardware or software, two intermediate storage units of size k+1 are required with one additional storage unit of size 1 bit to store the precomputed q. The quotient q is precomputed and then the value is used to compute the partial products which is stored in the two intermediate registers. The process is repeated k times. Finally, the two intermediate storage units are added together which is the Montgomery multiplication output. FIGS. 8a-b shows an expanded hardware implementation of the algorithm for odd moduli. The circuit on the right precomputes q while the circuit on the left computes the partial products. The top full adder (aka. the one that computes a.sub.s,k+1 . . . 0 and a.sub.c,k+1 . . . 0) is used to compute pc+ps+x(i).Math.y and the bottom full adder (aka. the one that computes b.sub.s,k+1 . . . 0 and b.sub.c,k+1 . . . 0) is used to add q.Math.m to the result. Full adders with 2 inputs are half adders. This is one method to implement the circuit and is not the only method. For example, the first full adder can be used to compute pc+ps+q.Math.m and the second circuit adds x(i).Math.y.

(21) TABLE-US-00002 Algorithm 2: CSA Montgomery Multiplication (as seen in Sutter et al., supra) Input: x, y, m < 2.sup.K, R = 2.sup.k, m′ = −m.sup.−1 mod m Output: x .Math. y .Math. R.sup.−1 mod m Begin 1. pc = 0; 2. ps = 0; 3. q = x(i) .Math. y(0); 4. for i in 0 . . . k − 1 loop 5.  q.sub.n = ((pc(1:0) + ps(1:0) + x(i) .Math. y(1:0) + q .Math. m(1:0))/2 + x(i + 1) .Math. y(0)) mod 2; 6.  (pc, ps) = (pc + ps + x(i) .Math. y + q .Math. m)/2; 7.  q = q.sub.n; 8. end loop 9. return pc + ps; End

(22) As is shown in Algorithm 2, there are many AND gates to perform x(i).Math.y and q.Math.m. Specifically, since y and m are at most k bits, we must have 2.Math.k AND gates. One simple optimization is to remove a single AND gate for odd moduli as this bit is set, so the q digit can be piped through for this wire in q.Math.m. Note that if the modulus m is constant in an implementation the AND gates for q.Math.m and the input to the Full Adder can be simplified and removed. This patent disclosure focuses on the general case and shape of primes. Any additional simplification of these gates beyond the claims is covered. The Full Adders are used to reduce the number of inputs in Line 6 of Algorithm 2 from 4 inputs to 2 outputs, which requires two Full Adders. This defines the carry-save nature of the Montgomery multiplication algorithm as Algorithm 2 reduces to a partial sum ps and partial carry pc but does not determine the sum of partial results until Line 9 of Algorithm 2.

(23) This invention focuses on special forms of prime moduli that can simplify this computational hierarchy. Notably, when the modulus is of the form f.Math.2.sup.e−1 or f.Math.2.sup.e+1, we can remove many Full Adder and AND gates at the least significant portion of the systolic architecture with special arithmetic tricks targeting that the least significant portion of the modulus will have all bits set.

(24) For a modulus that is k bits long, the prior art implementation utilizes 2.Math.k+3 Full Adders, 2.Math.k+4 AND gates, and 3 XOR gates. For moduli of the form f.Math.2.sup.e−1, our optimized implementation uses 2.Math.k+3−e Full Adders, 2.Math.k+4−e AND gates, and 2 XOR gates. Further, our next_q computation block has a smaller critical path as it requires fewer operations to go from input to output. The next_q computation is no longer dependent on the previous q. For moduli of the form f.Math.2.sup.e+1, our optimized implementation uses 2.Math.k+3−e Full Adders, 2.Math.k+4−e AND gates, and 3 XOR gates. In the systolic architecture, e AND gates and e Full Adders are saved. Considering some applications like SIDH use primes where e is half the bitlength of the prime, these optimizations, save almost 25% of cost to implement this modular multiplication.

(25) When considering this invention over an arbitrary radix or digit d, there are still many advantages with our design. Here, we unroll each loop to perform d iterations at a time. The space savings still exist here. Digit d=1 is shown in FIG. 2, FIG. 3, and FIG. 4, which uses a variety of Full Adders and AND gates. For higher digit sizes, the Full Adder and AND gates must be replicated to account for more bits to add and multiply. Digit d=2 is shown in FIGS. 5a-b, FIGS. 6a-b, and FIG. 7. In the prior art as is shown in FIGS. 5a-b, each increment in digit size leads to 2.Math.k+3 additional Full Adders, 2.Math.k+4 AND gates, and 3 XOR gates. Similar to the above, for moduli of the form f.Math.2.sup.e−1, each increment in digit size accounts for 2.Math.k+3−e Full Adders, 2.Math.k+4−e AND gates, and 2 XOR gates. The critical path for the next_q computation is again smaller. For moduli of the form f.Math.2.sup.e+1, each increment in digit size leads to 2.Math.k+3−e Full Adders, 2.Math.k+4−e AND gates, and 3 XOR gates. Thus, for a digit d, this invention saves de AND gates and de Full Adders.

(26) To illustrate the inventive step here, consider performing modular multiplication over the modulus 62207=3.sup.5.Math.2.sup.8−1. Thus, in this example, f=3.sup.5=243, e=8, and k=16. In the below table, we summarize the complexity to implement this in gate operations using the Montgomery multiplication algorithm and iterating over a d bits at a time. Across the board, the proposed invention requires approximately 25% fewer full adders and AND gates.

(27) TABLE-US-00003 TABLE 1 Gate complexity of prior art Montgomery multiplication over 62207 Digit d #Full Adders #AND #XOR 1 35 36 3 2 70 72 6 4 140 144 12 8 280 288 24

(28) TABLE-US-00004 TABLE 2 Gate complexity of proposed invention implementing Montgomery multiplication over 62207 Digit d #Full Adders #AND #XOR 1 27 28 2 2 54 56 4 4 108 112 8 8 216 224 16

(29) As a further example, consider performing modular multiplication over the modulus 10369=3.sup.4.Math.2.sup.7−1. Thus, in this example, f=3.sup.4=81, e=7, and k=14. In the below table, we summarize the complexity to implement this in gate operations using the Montgomery multiplication algorithm and iterating over a d bits at a time. Similar to before, the proposed invention requires approximately 25% fewer full adders and AND gates.

(30) TABLE-US-00005 TABLE 3 Gate complexity of prior art Montgomery multiplication over 10367 Digit d #Full Adders #AND #XOR 1 31 32 3 2 62 64 6 4 124 128 12 8 248 256 24

(31) TABLE-US-00006 TABLE 4 Gate complexity of proposed invention implementing Montgomery multiplication over 10367 Digit d #Full Adders #AND #XOR 1 24 25 2 2 48 50 4 4 96 100 8 8 192 200 16