EFFICIENT ARCHITECTURE AND METHOD FOR ARITHMETIC COMPUTATIONS IN POST-QUANTUM CRYPTOGRAPHY
20210320796 · 2021-10-14
Assignee
Inventors
Cpc classification
H04L9/3093
ELECTRICITY
H04L9/003
ELECTRICITY
G06F7/726
PHYSICS
H04L9/3073
ELECTRICITY
International classification
H04L9/30
ELECTRICITY
Abstract
A computer processing system for reducing a processing footprint in cryptosystems utilizing quadratic extension field arithmetic such as pairing-based cryptography, elliptic curve cryptography, code-based cryptography and post-quantum elliptic curve cryptography that includes at least one computer processor having a register file with three processor registers operably configured to implement quadratic extension field arithmetic equations in a finite field of F.sub.p.sup.2 and a multiplexer operably configured to selectively shift from each of the three processor registers in sequential order to generate modular additional results and modular multiplication results from the three processor registers.
Claims
1. A computer processing system for reducing a processing footprint in cryptosystems utilizing quadratic extension field arithmetic comprising: at least one computer processor having a register file having three processor registers operably configured to implement quadratic extension field arithmetic equations in a finite field of F.sub.p.sup.2 and a multiplexer operably configured to selectively shift from each of the three processor registers in sequential order to generate modular addition results and modular multiplication results from the three processor registers.
2. The computer processing system according to claim 1, wherein: the register file is of a circular-access register file block.
3. The computer processing system according to claim 1, wherein three processor registers consist essentially of: a first work processor register, a second work processor register, and an accumulator register.
4. The computer processing system according to claim 3, further comprising: a digit-serial adder unit operably coupled to the first work processor register and a constant prime number for which the finite field of F.sub.p.sup.2 is defined; and a modular multiplication unit operably coupled to the second work processor register and the constant prime number for which the finite field of F.sub.p.sup.2 is defined.
5. The computer processing system according to claim 4, wherein: the digit-serial adder unit and modular multiplication unit are single units.
6. The computer processing system according to claim 1, wherein: the quadratic extension field arithmetic equations have an irreducible polynomial of i.sup.2+1.
7. The computer processing system according to claim 1, wherein the register file further comprises: five processor registers operably configured to implement quadratic extension field arithmetic equations in a finite field of F.sub.p.sup.2, the five processor registers consist essentially of a first work processor register, a second work processor register, a third work processor register, a first accumulator register, and a second accumulator register.
8. A computer-implemented method for reducing a processing footprint in cryptosystems utilizing quadratic extension field arithmetic comprising the steps of: providing at least one computer processor having a register file with a first work processor register, a second work processor register, and an accumulator register; initiating, through the least one computer processor, a cryptography session that includes: defining a constant prime number p for which a finite field of F.sub.p.sup.2 is defined; receiving numerical data into the first and second work processor registers sequentially and executing quadratic extension field arithmetic equations in the finite field of F.sub.p.sup.2 using the numerical data to generate arithmetic results from each of the first and second work processor registers; receiving the arithmetic results into the accumulator register and providing the arithmetic results from the accumulator register to the first work processor register to define a circular-access register file block; and outputting the arithmetic results through the first work processor register.
9. The computer-implemented method according to claim 8, further comprising: receiving numerical data into the first and second work processor registers sequentially through a one-way data flow.
10. The computer-implemented method according to claim 8, wherein: the first work processor register includes a digit-serial adder unit operably coupled to the constant prime number for which the finite field of F.sub.p.sup.2 is defined; and the second work processor register includes a modular multiplication unit operably coupled to the constant prime number for which the finite field of F.sub.p.sup.2 is defined.
11. The computer-implemented method according to claim 8, wherein the cryptography session further comprises: sequentially shifting from first and second work processor registers and the accumulator register through a multiplexer.
12. The computer-implemented method according to claim 8, wherein the cryptography session further comprises: choosing a quadratic extension field with an irreducible polynomial of i.sup.2+1 after the constant prime number p has been defined.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0021] The accompanying figures, where like reference numerals refer to identical or functionally similar elements throughout the separate views and which together with the detailed description below are incorporated in and form part of the specification, serve to further illustrate various embodiments and explain various principles and advantages all in accordance with the present invention.
[0022]
[0023]
[0024]
DETAILED DESCRIPTION
[0025] While the specification concludes with claims defining the features of the invention that are regarded as novel, it is believed that the invention will be better understood from a consideration of the following description in conjunction with the drawing figures, in which like reference numerals are carried forward. It is to be understood that the disclosed embodiments are merely exemplary of the invention, which can be embodied in various forms.
[0026] The present invention provides a novel and efficient system and method for reducing the processing footprint of arithmetic computations for cryptosystems. More specifically, the system and method is directed toward a cryptosystem having a lightweight accelerator for arithmetic operations in F.sub.p.sup.2 (hereinafter “F.sub.p.sup.2 accelerator”) containing p.sup.2 elements, wherein “p” is a prime number, preferably a constant prime number for which F.sub.p.sup.2 field is defined. The system is lightweight in that it is structurally and operably configured to minimize the number of processor registers, namely intermediate registers, uses a circular work register block, and includes a digit-serial adder and modular multiplier. Specifically, the system is configured with hardware and instructions to target prime fields where extension field arithmetic can be defined over an irreducible polynomial i.sup.2+1, wherein this characteristic is specifically targeted to design the F.sub.p.sup.2 accelerator.
[0027] One embodiment of the present invention is shown schematically through a block diagram in
[0028] As those of skill in the art will appreciate, a register file is an array of processor registers in a central processing unit (CPU). In one embodiment, the processor registers 104, 106, 108 are circuit-based register files and may be implemented by way of fast static random access memories (RAMs) with multiple ports. Such RAMs are distinguished by having dedicated read and write ports, whereas ordinary multi-ported RAMs will usually read and write through the same ports. In other embodiments, the processor registers 104, 106, 108 may be implemented by way of fast dynamic RAMs. The instruction set architecture of a CPU will almost always define a set of registers which are used to stage data between memory and the functional units on the chip. In simpler CPUs, these architectural registers correspond one-for-one to the entries in a physical register file (PRF) within the CPU. More complicated CPUs use register renaming, so that the mapping of which physical entry stores a particular architectural register changes dynamically during execution.
[0029]
[0030] In the embodiment shown in
[0031] More specifically, the cryptography session may include step 306 defining a constant prime number, p, for which a finite field of F.sub.p.sup.2 is defined. The prime number p may be predefined or dynamically and selectively defined by the user and/or the processor. Next, step 308 may include receiving numerical data into the first and second work processor registers 104, 106 sequentially and executing quadratic extension field arithmetic equations (represented schematically in
[0032] As those of skill in the art will appreciate, the multiplexer 110 is a device that selects one of several analog or digital input signals and forwards the selected input into a single line. A multiplexer of 2.sup.n inputs has n select lines, which are used to select which input line to send the output. A multiplexer can also be used to implement Boolean functions of multiple variables.
[0033] The system may also include a digit-serial adder unit 114 operably coupled to the first work processor register 104 and a constant prime number p (depicted in
[0034] The quadratic extension field arithmetic equations operably configured to be carried out by the F.sub.p.sup.2 accelerator can include modular addition, subtraction, squaring, multiplication, and inversion in the finite field F.sub.p.sup.2. More specifically, elements in F.sub.p.sup.2 are represented in the form of A=a.sub.0+ia.sub.1, wherein i is a non-quadratic residue and a.sub.0 and a.sub.1 are elements of F.sub.p. When the irreducible polynomial is selected as i.sup.2+1, where i=√{square root over (−1)} in F.sub.p.sup.2, certain finite field operations can be defined in F.sub.p.sup.2 with specific formulas involving F.sub.p.sup.2 operations:
A+B=a.sub.0+b.sub.0+i(a.sub.1+b.sub.1)
A−B=a.sub.0−b.sub.0+i(a.sub.1−b.sub.1)
A×B=(a.sub.0+a.sub.1)(b.sub.0−b.sub.1)+a.sub.0b.sub.1−a.sub.1b.sub.0+i(a.sub.0b.sub.1+a.sub.1b.sub.0)
A.sup.2=(a.sub.0+a.sub.1)(a.sub.0−a.sub.1)+i2a.sub.0a.sub.1
A.sup.−1=a.sub.0(a.sub.0.sup.2+a.sub.1.sup.2).sup.−1−ia.sub.1(a.sub.0.sup.2+a.sub.1.sup.2).sup.−1
[0035] Thus, an addition and subtraction require 2 F.sub.p additions, a multiplication requires 3 F.sub.p multiplications and 5 F.sub.p additions, a squaring requires 2 F.sub.p multiplications and 3 F.sub.p additions, and an inversion requires 1 F.sub.p inversion, 2 F.sub.p multiplications, 2 F.sub.p squarings, and 2 F.sub.p additions. Table 1, below, exemplifies the F.sub.p.sup.2 accelerator carrying out F.sub.p.sup.2 squaring. More specifically, assume that we are squaring the value A=a.sub.0+ia.sub.1 to get A.sup.2=(a.sub.0+a.sub.1)(a.sub.0−a.sub.1)+i2a.sub.0a.sub.1. This architecture uses an accumulate-based approach. It is also assumed the file register is of a circular register buffer in the order register R.sub.0 to R.sub.1 to R.sub.2 to R.sub.0 and that R.sub.0 and R.sub.1 are work registers. Therefore, A.sup.2 and A×B with the controls shown in Tables 1-2, respectively. Said another way, one register, e.g., register 108 is temporary to accumulate the intermediate results and perform operations sequentially in the work registers, e.g., registers 104, 106.
TABLE-US-00001 TABLE 1 Control R.sub.0 R.sub.1 R.sub.2 Load a.sub.0 a.sub.0 Load a.sub.1 a.sub.1 a.sub.0 Add a.sub.0 + a.sub.1 a.sub.0 Load a.sub.1 a.sub.1 a.sub.0 + a.sub.1 a.sub.0 Shift a.sub.0 a.sub.1 a.sub.0 + a.sub.1 Subtract a.sub.0 − a.sub.1 a.sub.1 a.sub.0 + a.sub.1 Shift a.sub.0 + a.sub.1 a.sub.0 − a.sub.1 a.sub.1 Multiply (a.sub.0 + a.sub.1)(a.sub.0 − a.sub.1) a.sub.0 − a.sub.1 a.sub.1 Store R.sub.0 (a.sub.0 + a.sub.1)(a.sub.0 − a.sub.1) a.sub.0 − a.sub.1 a.sub.1 Shift a.sub.1 (a.sub.0 + a.sub.1)(a.sub.0 − a.sub.1) a.sub.0 − a.sub.1 Load a.sub.0 a.sub.0 a.sub.1 (a.sub.0 + a.sub.1)(a.sub.0 − a.sub.1) Multiply a.sub.0a.sub.1 a.sub.1 (a.sub.0 + a.sub.1)(a.sub.0 − a.sub.1) Copy R.sub.0 a.sub.0a.sub.1 a.sub.0a.sub.1 a.sub.1 Add 2a.sub.0a.sub.1 a.sub.0a.sub.1 a.sub.1 Store R.sub.0 2a.sub.0a.sub.1 a.sub.0a.sub.1 a.sub.1
[0036] Table 2, below, exemplifies the F.sub.p.sup.2 accelerator carrying out F.sub.p.sup.2 multiplication.
TABLE-US-00002 TABLE 2 Control R.sub.0 R.sub.1 R.sub.2 Load a.sub.0 a.sub.0 Load a.sub.1 a.sub.1 a.sub.0 Add a.sub.0 + a.sub.1 a.sub.0 Load b.sub.0 b.sub.0 a.sub.0 + a.sub.1 a.sub.0 Load b.sub.1 b.sub.1 b.sub.0 a.sub.0 + a.sub.1 Subtract b.sub.0 − b.sub.1 b.sub.0 a.sub.0 + a.sub.1 Shift a.sub.0 + a.sub.1 b.sub.0 − b.sub.1 b.sub.0 Multiply t = (a.sub.0 + a.sub.1)(b.sub.0 − b.sub.1) b.sub.0 − b.sub.1 b.sub.0 Store R.sub.0 t b.sub.0 − b.sub.1 b.sub.0 Load a.sub.1 a.sub.1 t b.sub.0 − b.sub.1 Load b.sub.0 b.sub.0 a.sub.1 t Multiply a.sub.1b.sub.0 a.sub.1 t Store R.sub.0 at b.sub.0 a.sub.1b.sub.0 a.sub.1 t Shift t a.sub.1b.sub.0 a.sub.1 Subtract t − a.sub.1b.sub.0 a.sub.1b.sub.0 a.sub.1 Load a.sub.0 a.sub.0 t − a.sub.1b.sub.0 a.sub.1b.sub.0 Load b.sub.1 b.sub.1 a.sub.0 t − a.sub.1b.sub.0 Multiply a.sub.0b.sub.1 a.sub.0 t − a.sub.1b.sub.0 Shift t − a.sub.1b.sub.0 a.sub.0b.sub.1 a.sub.0 Add t + a.sub.0b.sub.1 − a.sub.1b.sub.0 a.sub.0b.sub.1 a.sub.0 Store R.sub.0 t + a.sub.0b.sub.1 − a.sub.1b.sub.0 a.sub.0b.sub.1 a.sub.0 Load a.sub.1b.sub.0 a.sub.1b.sub.0 t + a.sub.0b.sub.1 − a.sub.1b.sub.0 a.sub.0b.sub.1 Shift a.sub.0b.sub.1 a.sub.1b.sub.0 t + a.sub.0b.sub.1 − a.sub.1b.sub.0 Add a.sub.0b.sub.1 + a.sub.1b.sub.0 a.sub.1b.sub.0 t + a.sub.0b.sub.1 − a.sub.1b.sub.0 Store R.sub.0 a.sub.0b.sub.1 + a.sub.1b.sub.0 a.sub.1b.sub.0 t + a.sub.0b.sub.1 − a.sub.1b.sub.0
Table 3, below, depicts operation complexities for F.sub.p.sup.2 arithmetic as compared to known prior art. Specifically, the system and method of the present invention utilizes these multiplication and squaring formulas to minimize their arithmetic complexity. As those of skill in the art will appreciate, F.sub.p multiplication is much more arithmetically intensive than F.sub.p addition.
TABLE-US-00003 TABLE 3 F.sub.p Operation Add Multiplication Inversion F.sub.p.sup.2 Operation (A) (M) (I) Addition (A) 2 0 0 Multiplication (M) (Prior Art) 2 4 0 Multiplication (M) (F.sub.p.sup.2 Accelerator) 5 3 0 Squaring (S) (Prior Art) 2 3 0 Squaring (S) (F.sub.p.sup.2 Accelerator) 3 2 0 Inversion (I) 2 4 1
[0037] One benefit to the architecture depicted in
[0038] Therefore, to minimize the register footprint, the present inventions preferably utilizes two work registers 104, 106 and an accumulator register 108, which is sufficient for all necessary F.sub.p.sup.2 operations. The number of arithmetic operations needed is then minimized by choosing a quadratic extension field with the irreducible polynomial i.sup.2+1. In the resulting architecture, digit-serial addition and modular multiplication units 114, 116 may also be utilized to perform the necessary prime field arithmetic in series, using circular register shifts and loads, implemented through the multiplexer as needed.
[0039] Beneficially, some principal applications of the present invention may include reducing the processing area overhead (i.e., the number of gates in a digital circuit) to perform F.sub.p.sup.2 arithmetic. As such, features such as providing a circular register file, an accumulator-based flow with minimum register complexity, the choice of special form for primes in quadratic extension fields, and small digit-serial arithmetic units to create a separated F.sub.p.sup.2 arithmetic unit generates the advantageous small processing footprint in hardware. Another goal or target of the aforementioned system and method is in post-quantum cryptography use, which also translates to uses in elliptic curve cryptography. Other applications include pairing-based cryptography, elliptic curves with isomorphisms (such as 4Q), code-based cryptography, and lattices. With reference now to
[0040] More specifically, the F.sub.p.sup.2 accelerator architecture 200 utilizes the bit-parallel modular multiplier 202 and a digit-serial adder 204 with a 32-bit adder/subtractor. In this embodiment, two additional work registers 208, 210 (also depicted in
TABLE-US-00004 TABLE 4 Prime #FF #FA #AND #NOT #2:1MUX P.sub.434 2200 900 871 32 5706 2.sup.2163.sup.137 − 1 P.sub.503 2535 1038 1009 32 6603 2.sup.2503.sup.159 − 1 P.sub.546 2792 1124 1095 32 7162 2.sup.2733.sup.177 − 1 P.sub.751 3791 1534 1505 32 9827 2.sup.3723.sup.239 − 1 P.sub.960 4866 1952 1923 32 12544 2.sup.4863.sup.301 − 1 p.sub.m 5m + 2 + 2 2m + d 2m + 3 d 13m + 2d (d − m % d)
[0041] Table 5, below, depicts implementation results of instantiated F.sub.p.sup.2 accelerator on Artix-7 FPGA.
TABLE-US-00005 TABLE 5 Latency (clock cycles) Area F.sub.p.sup.2 Prime #FFs #LUTs F.sub.p add F.sub.p mult. F.sub.p.sup.2 add mult. p.sub.434 2,233 2,618 47 470 94 1,645 p.sub.503 2,583 3,059 53 543 106 1,894 p.sub.751 3,837 4,459 77 807 154 2,806
[0042] The primary source of gates come from the bit-parallel multiplier and there are also many 2:1 multiplexers as a result of interfacing between the registers for various operations. As reflected in Table 5, above, the efficacy of the F.sub.p.sup.2 architecture in an Artix-7 FPGA can be seen. This implementation, however, did not include additional control signals, which would constitute an extremely small additional area overhead. The processing area and time complexities are also shown in Table 5 and it can be seen that the F.sub.p.sup.2 time complexity uses the operation counts in Table 3. The critical path is the digit-serial adder, which was a 32-bit adder in this case. When discussing the complexity of this instantiated F.sub.p.sup.2 accelerator, it is important to note the naive approach. From Table 5, again, it becomes readily apparent that performing extra multiplications becomes very costly.
[0043] Thus, by using the formulas we provide for F.sub.p.sup.2 operations, the aforementioned system and method are saving many cycles by replacing a modular multiplication with a few modular additions or subtractions. In Table 6, below, a timing comparison between the present invention's F.sub.p.sup.2 multiplication and squaring formulas versus naive formulas are shown. It can be seen that F.sub.p.sup.2 multiplication is approximately 20% faster and F.sub.p.sup.2 squaring is 30% faster when compared to naive formulas. However, the choice of modular multiplier and modular adder were combined in three work registers. This adds additional multiplexers for interfacing logic, but also reduces multiplexers that would be used to initialize these registers. The logical flow of the registers in the circular buffer eliminates the need for excess 2:1 multiplexers.
TABLE-US-00006 TABLE 6 Latency (clock cycles) Naïve Arithmetic F.sub.p.sup.2 Arithmetic F.sub.p.sup.2 F.sub.p.sup.2 F.sub.p.sup.2 F.sub.p.sup.2 Prime multiplication squaring multiplication squaring p.sub.434 1,974 1,504 1,645 1,081 p.sub.503 2,278 1,735 1,894 1,245 p.sub.751 3,382 2,575 2,806 1,845
[0044] Table 7, below, depicts the F.sub.p.sup.2 accelerator performing addition operations.
TABLE-US-00007 TABLE 7 Control R.sub.0 ps carry Initial conditions 70 45 0 Initial conditions in binary 8b0100_0110 8b0010_1101 0 Add lowest digit and store to ps 8b0100_0110 8b0010_0011 1 Shift d bits 8b0110_0100 8b0011_0010 1 Add lowest digit with carry to ps 8b0110_0100 8b0011_0111 0 Shift d bits 8b0100_0110 8b0111_0011 0 Final result is ps 70 115 0
[0045] By way of example utilizing the F.sub.p.sup.2 accelerator depicted in
TABLE-US-00008 TABLE 8 Control R.sub.0 = x R.sub.1 = y ps = Sc pc = Ss q Initialize ps and pc to zero 13 5 0 0 0 Step 0 13 5 0 0 1 Step 1 13 5 33 5 0 Step 2 13 5 19 0 0 Step 3 13 5 10 2 1 Step 4 13 5 39 5 0 Step 5 13 5 20 2 0 Step 6 13 5 11 0 1 Final conditions 13 5 38 3 0 Final result is ps + pc = 41 13 5 38 3 0
[0046] Table 8, above, depicts an example of modular multiplication in depth. This performs a bit-parallel modular multiplier based on the Montgomery multiplication algorithm. For this, the system will calculate m smaller modular multiplications for an m-bit prime p. For p=71, we have a 7-bit prime, so we perform 7 smaller modular multiplications for the final result. The idea of a multiplier is that it is based on the carry-save adder technique, where we perform very fast additions in parallel without worrying about carry propagations. Algorithm 1, below, shows the algorithm that the multiplier follows. Table 8 shows a cycle-by-cycle breakdown of what the 7 cycles will be in Table 8. We perform the multiplication with a=13 and b=5, which when using R=2.sup.7, will produce the result c=abR.sup.−1 mod 71=41.
[0047] Algorithm 1—Bit-serial CSA Montgomery multiplication. Sc and Ss refer to the carry and sum bits of the result S, respectfully. Specifically, the input may be a Modulus M, R=2.sup.k>M, operands x, y<M. The output may be z=xyR.sup.−1 mod M. Therefore, an exemplary software-based algorithm to effectuate the same is depicted below:
TABLE-US-00009 1. Sc = 0,S s = 0 2. for i in 0 to (k) − 1 do 3. q = Sc[0] + Ss[0] + x[i] .Math. y[0] mod 2 4. (Sc, Ss) = (Sc[i] + Ss[i] + x[i] .Math. y + q[i] .Math. M)/2 5. end for 6. z = Sc + Ss 7. if z ≥ M, then z = z − M 8. return z = xyR.sup.-1 mod M
[0048] Table 9, below, depicts exemplary results of the F.sub.p.sup.2 accelerator generating computations and depicting what the registers hold when implementing exemplary quadratic extension field arithmetic equations in a finite field of F.sub.p.sup.2. For this example, p=71 and perform an F.sub.p.sup.2 multiplication (with irreducible polynomial i.sup.2+1) between A=5+3i and B=13+9i to produce the result C. What is not shown, for brevity, are the cycle-by-cycle contents of the modular multiplier or addition, as exemplary results of these are depicted in Tables 7 and 8, respectively. The values are converted to the Montgomery form by performing the computation a.sub.jR mod p and b.sub.jR mod p with R=2.sup.7 for j=0; 1. Thus, the inputs in the Montgomery domain are A=1+29i and B=31+16i. The final value for C=36+31i, so the normal representation of this result is C=38+13i, which can be recovered in a simple way by performing the computations C=(5×13-3×9)+(5×9+3×13)=38+13i.
TABLE-US-00010 TABLE 9 Control R.sub.0 R.sub.1 R.sub.2 ps pc Perform a.sub.0 + a.sub.1 mod p (F.sub.p addition), a.sub.0 = 1, a.sub.1 = 29 Load a.sub.0 1 Store R.sub.0 to ps 1 1 Load a.sub.1 29 1 Add 29 1 30 Load p 71 1 30 Copy ps 71 1 30 30 Subtract p 71 1 −41 30 Store mod p res 30 1 −41 30 Perform b.sub.0 − b.sub.1 mod p (F.sub.p subtraction), b.sub.0 = 31, b.sub.1 = 16 Load b.sub.0 31 30 1 −41 30 Store R.sub.0 to ps 31 30 1 31 30 Load b.sub.1 16 31 30 31 30 Subtract 16 31 30 15 30 Load p 71 31 30 15 30 Copy ps 71 31 30 15 15 Add p 71 31 30 86 15 Store mod p res 15 31 30 86 15 Perform t = (a.sub.0 + a.sub.1)(b.sub.0 − b.sub.1) mod p (F.sub.p multiplication) a.sub.0 + a.sub.1 = 30, b.sub.0 − b.sub.1 = 15 Shift 30 15 31 86 15 Multiply 30 15 31 46 3 Store pc 3 15 31 46 3 Add pc + ps 3 15 31 49 3 Load p 71 15 31 49 3 Copy ps 71 15 31 49 49 Sub p 71 15 31 −22 49 Store mod p res 49 15 31 −22 49 Perform a.sub.1b.sub.0 mod p (F.sub.p multiplication), a.sub.1 = 29, b.sub.0 = 31 Load a.sub.1 29 49 31 −22 49 Load b.sub.0 31 29 49 −22 49 Multiply 31 29 49 22 0 Store pc 0 31 49 22 0 Add pc + ps 0 31 49 22 0 Load p 71 31 49 22 0 Copy ps 71 31 49 22 22 Sub p 71 31 49 −49 22 Store mod p res 22 31 49 −49 22 Store R.sub.0 at c.sub.1 22 31 49 −49 22 Perform u = t − a.sub.1b.sub.0 mod p (F.sub.p subtraction), t = 49, a.sub.1b.sub.0 = 22 Store R.sub.0 to ps 22 31 49 22 22 Shift 49 22 31 22 22 Subtract 49 22 31 27 22 Load p 71 22 31 27 22 Copy ps 71 22 31 27 27 Add p 71 22 31 98 27 Store mod p res 27 22 31 98 27 Perform a.sub.0b.sub.1 mod p (F.sub.p multiplication), a.sub.0 = 1, b.sub.1 = 16 Load a.sub.0 1 27 22 98 27 Load b.sub.1 16 1 27 98 27 Multiply 16 1 27 9 0 Store pc 0 1 27 9 0 Add pc + ps 0 1 27 9 0 Load p 71 1 27 9 0 Copy ps 71 1 27 9 9 Sub p 71 1 27 −62 9 Store mod p res 9 1 27 −62 9 Perform c.sub.0 = u + a.sub.0b.sub.1 mod p (F.sub.p addition), u = 27, a.sub.0b.sub.1 = 9 Store R.sub.0 to ps 9 1 27 9 9 Shift 27 9 1 9 9 Add 27 9 1 36 9 Load p 71 27 9 36 9 Copy ps 71 27 9 3 36 Subtract p 71 27 9 −35 36 Store mod p res 36 27 9 −35 36 Store R.sub.0 at c.sub.0 36 27 9 −35 36 Perform c.sub.1 = a.sub.0b.sub.1 + a.sub.1b.sub.0 mod p (F.sub.p addition), a.sub.0b.sub.1 = 9, a.sub.1b.sub.0 = 22 Shift 9 36 27 −35 36 Store R.sub.0 to ps 36 27 9 9 36 Load a.sub.1b.sub.0 22 9 27 9 36 Add 22 9 27 31 36 Load p 71 9 27 31 58 Copy ps 71 9 27 31 31 Subtract p 71 9 27 −40 31 Store mod p res 31 9 27 −40 31 Store R.sub.0 at c.sub.1 31 9 27 −40 31