LOW-LATENCY PIPELINE AND METHOD FOR USE OF A LOW LATENCY PIPLINE IN HOMOMORPHIC ENCRYPTION
20230216656 · 2023-07-06
Inventors
Cpc classification
H04L9/3093
ELECTRICITY
G06F17/16
PHYSICS
International classification
H04L9/00
ELECTRICITY
G06F17/16
PHYSICS
Abstract
A low latency relinearization process can be performed in an FPGA cluster for accelerating homomorphic encryption. The low-latency process performs an early calculation of matrix rows to make the summation result available earlier in the relinearization to reduce waiting of subsequent operations.
Claims
1. A low-latency relinearization method implemented by a field programmable gate array (FPGA) cluster comprising: receiving at the FPGA cluster a polynomial vector comprising R components; and performing modular reduction on the polynomial vector to generate a polynomial matrix comprising R+1 rows and L columns. multiplying the polynomial matrix by a first Keyswitch matrix to generate a first intermediate polynomial matrix; summing a last row of the first intermediate polynomial matrix to generate a first early-summation element and performing a modular reduction on the first early-summation element to generate a first early-summation vector; summing remaining rows of the first intermediate polynomial matrix to generate a first summation vector; subtracting the first early-summation vector from the first summation vector to generate a first subtraction-result vector; multiplying the first subtraction-result vector by an inverse moduli vector to generate a first result vector; and outputting from the FPGA cluster the first result vector.
2. The method of claim 1, wherein multiplying the polynomial matrix by the first Keyswitch matrix to generate the first intermediate polynomial matrix is done to generate the last row of the first Keyswitch matrix first.
3. The method of claim 1, wherein the intermediate polynomial matrix is larger than available memory resources and is processed in slices of rows.
4. The method of claim 1, further comprising: multiplying the polynomial matrix by a second Keyswitch matrix to generate a second intermediate polynomial matrix; summing a last row of the second intermediate polynomial matrix to generate a second early-summation element and performing a modular reduction on the second early-summation element to generate a second early-summation vector; summing remaining rows of the second intermediate polynomial matrix to generate a second summation vector; and subtracting the second early-summation vector from the second summation vector to generate a second subtraction-result vector; multiplying the second subtraction-result vector by an inverse moduli vector to generate a second result vector; and outputting from the FPGA cluster the second result vector.
5. The method of claim 1, wherein the received polynomial vector is in a number-theoretical transform (NTT) domain, the method further comprising: prior to performing modular reduction on the polynomial vector, transforming the NTT domain polynomial vector to the INTT domain; prior to multiplying the polynomial matrix by the first Keyswitch matrix, transforming the INTT domain polynomial matrix to the NTT domain; prior to performing the modular reduction on the first early-summation element, transforming the NTT domain first early-summation element to the INTT domain.
6. The method of claim 5, wherein the FPGA cluster comprises: a plurality of NTT modules for transforming a vector or matrix from the INTT domain to the NTT domain; a plurality of INTT modules for transforming a vector or matrix from the NTT domain to the INTT domain; and a plurality of multiplication modules for multiplying polynomial vectors in the NTT domain.
7. The method of claim 6, wherein the plurality of NTT modules, INTT modules and multiplication modules are arranged to provide a pipeline for performing the low-latency relinearization method.
8. The method of claim 7, wherein the pipeline is arranged in a plurality of stages comprising: a first stage for transforming the NTT domain polynomial vector to the INTT domain; a second stage for transforming the INTT the polynomial matrix to the NTT domain; a third stage for multiplying the polynomial matrix in the NTT domain by the first Keyswitch matrix; a fourth stage for transforming the first early-summation element and the summation vector from the NTT domain to the INTT domain; a fifth stage for transforming the first subtraction-result vector from the INTT domain to the NTT domain; and a sixth stage for multiplying the first subtraction-result vector in the NTT domain by the inverse moduli vector.
9. The method of claim 8, wherein: the first stage further performs modular reduction on the INTT domain polynomial vector; the third stage performs the modular addition for each row vector; and the fifth stage performs the modular subtraction.
10. The method of claim 9, wherein the fourth stage further performs modular reduction on the INTT domain first early-summation element.
11. A field programmable gate array (FPGA) cluster for use in homomorphic encryption comprising: a plurality of FPGAs to provide a configured to provide a pipeline for providing a low-latency relinearization method comprising: receiving at the FPGA cluster a polynomial vector comprising R components; performing modular reduction on the polynomial vector to generate a polynomial matrix comprising R+1 rows and R columns. multiplying the polynomial matrix by a first Keyswitch matrix to generate a first intermediate polynomial matrix; summing a last row of the first intermediate polynomial matrix to generate a first early-summation element and performing a modular reduction on the first early-summation element to generate a first early-summation vector; summing remaining rows of the first intermediate polynomial matrix to generate a first summation vector; subtracting the first early-summation vector from the first summation vector to generate a first subtraction-result vector; multiplying the first subtraction-result vector by an inverse moduli vector to generate a first result vector; and outputting from the FPGA cluster the first result vector.
12. The FPGA cluster of claim 11, wherein the intermediate polynomial matrix is larger than available on chip memory and DSP resource and is processed in slices of rows.
13. The FPGA cluster of claim 11, wherein the received polynomial vector is in a number-theoretical transform (NTT) domain, the method further comprising: prior to performing modular reduction on the polynomial vector, transforming the NTT domain polynomial vector to the INTT domain; prior to multiplying the polynomial matrix by the first Keyswitch matrix, transforming the INTT domain polynomial matrix to the NTT domain; prior to performing the modular reduction on the first early-summation element, transforming the NTT domain first early-summation element to the INTT domain.
14. The FPGA cluster of claim 11, wherein the plurality of FPGAs provide: a plurality of NTT modules for transforming a vector or matrix from the INTT domain to the NTT domain; a plurality of INTT modules for transforming a vector or matrix from the NTT domain to the INTT domain; and a plurality of multiplication modules for multiplying polynomial vectors in the NTT domain,
15. The FPGA cluster of claim 13, wherein the pipeline is arranged in a plurality of stages comprising: a first stage for transforming the NTT domain polynomial vector to the INTT domain; a second stage for transforming the INTT domain polynomial matrix to the NTT domain; a third stage for multiplying the polynomial matrix in the NTT domain by the first Keyswitch matrix; a fourth stage for transforming the first early-summation element from the NTT domain to the INTT domain; a fifth stage for transforming the first subtraction-result vector from the INTT domain to the NTT domain; and a sixth stage for multiplying the first subtraction-result vector in the NTT domain by the inverse moduli vector.
16. The FPGA cluster of claim 15, wherein: the first stage further performs modular reduction on the INTT domain polynomial vector; the third stage performs the modular addition for each row vector; and the fifth stage performs the modular subtraction.
17. The FPGA cluster of claim 16, wherein the fourth stage further performs modular reduction on the INTT domain first early-summation element.
18. The FPGA cluster of claim 11, wherein multiplying the polynomial matrix by the first Keyswitch matrix to generate the first intermediate polynomial matrix is done to generate the last row of the first Keyswitch matrix first.
19. The FPGA cluster of claim 11, wherein the method provided by the pipeline further comprise: multiplying the polynomial matrix by a second Keyswitch matrix to generate a second intermediate polynomial matrix; summing a last row of the second intermediate polynomial matrix to generate a second early-summation element and performing a modular reduction on the second early-summation element to generate a second early-summation vector; summing remaining rows of the second intermediate polynomial matrix to generate a second summation vector; and subtracting the second early-summation vector from the second summation vector to generate a second subtraction-result vector; multiplying the second subtraction-result vector by an inverse moduli vector to generate a second result vector; and outputting from the FPGA cluster the second result vector.
20. The FPGA cluster of claim 11, wherein the received polynomial vector is in a number-theoretical transform (NTT) domain, the method further comprising: prior to performing modular reduction on the polynomial vector, transforming the NTT domain polynomial vector to the INTT domain; prior to multiplying the polynomial matrix by the first Keyswitch matrix, transforming the INTT domain polynomial matrix to the NTT domain; prior to performing the modular reduction on the first early-summation element, transforming the NTT domain first early-summation element to the INTT domain.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0026] Further features and advantages of the present disclosure will become apparent from the following detailed description, taken in combination with the appended drawings, in which:
[0027]
[0028]
[0029]
[0030]
DETAILED DESCRIPTION
[0031] A homomorphic encryption pipeline may be used to provide the encryption process. The pipeline may be characterized by various performance characteristics, including for example throughput and latency. The throughput provides a measurement of how much data the pipeline can process in a given amount of time, whereas latency provides a measurement of how long it takes after providing an input for the output to be available from the pipeline. Both may be important characteristics and a trade-off between the two may be required for different applications. That is, in certain applications, a higher throughput may be desirable even at the cost of higher latency, while in other applications, a lower latency may be desirable even at the cost of lower throughput. The current process provides a lower latency process for the relinearization used in homomorphic encryption. The low-latency process performs an early sum calculation on a last row of a matrix in order to make the result available as remaining matrix rows are summed. The early sum calculation can significantly reduce the latency while only slightly reducing the throughput.
[0032] The homomorphic encryption process may make use of a relinearization process, which in turn comprises a Keyswitch process and a Residual Number System (RNS) Floor operation. The Keyswitch and RNS Floor algorithm works as follows. An input vector of ciphertext C.sub.i is received, where i ∈[0, R−1], each C.sub.i is a polynomial of degree N−1 and the coefficient bit width is q.sub.i. R is the number of RNS components. Two Keyswitch keys are used to transform the ciphertext into constant term and the linear term. The structure of both keys are the same. Each key is represented as a matrix shown in Table 1, where K.sub.i,j is a polynomial and i,j represents the RNS component i and moduli P.sub.j. i ∈[0,R−1] and j∈[0, R]
TABLE-US-00001 TABLE 1 The Keyswitch key structure K0, 0 K1, 0 . . . KR − 1, 0 K1, 0 K1, 1 . . . K1, R − 1 . . . . . . . . . . . . KR, 0 KR, 1 . . . KR, R − 1
[0033] Since the Keyswitch key is the same structure for both sets of key, only one is shown in Table 1 and in later explanation for the algorithm. The relinearization process outputs constant term and linear term ciphertext, Const.sub.i where i ∈[0, R−1], and Linear.sub.i where i ∈[0, R−1].
[0034] In calculating the constant and linear terms, the input ciphertext vector is expanded to a matrix by doing modular reduction on P.sub.j, the resulting ciphertext matrix is depicted in Table 2.
TABLE-US-00002 TABLE 2 the ciphertext matrix structure C0, 0 C1, 0 . . . CR − 1, 0 C1, 0 C1, 1 . . . C1, R − 1 . . . . . . . . . . . . CR, 0 CR, 1 . . . CR, R − 1
[0035] The elements of the ciphertext matrix are updated by performing a pair-wise multiplication between the ciphertext matrix and the Keyswitch key matrix according to: C.sub.i,j=C.sub.i,j⊙K.sub.i,j mod P.sub.j, where ⊙ is the pair-wise multiplication of matrices. Modular addition is performed for each element in a row as Sum.sub.j=Σ.sub.i=0.sup.R−1Ci,j. After the modular addition, a single column vector is generated as shown in Table 3
TABLE-US-00003 TABLE 3 Column vector for the result after modular addition for each row Sum0 Sum1 . . . SumR − 1 SumR
[0036] A modular reduction is performed on the last element of the column vector as Sum.sub.R,j=Sum.sub.R mod P.sub.j, j∈[0, R−1]. The output is calculated according to Const.sub.i=Sum.sub.i−Sum.sub.R,i. As can be seen, Sum.sub.R must be available before determining any of the elements of the output vector. The current process performs the calculation early so that it is available before the other elements Sum.sub.m where m ∈[0, R−1].
[0037]
[0038] As part of the relinearization process, modular reduction is performed on a ciphertext vector of polynomials to generate a matrix of polynomials representing a ciphertext. The matrix of polynomials is multiplied by a pair of Keyswitch matrices to generate two intermediate matrices, which are processed in the same manner. The rows of the matrices are summed together to generate respective column vectors and a modular reduction is performed on the last element of each column vector, which is then subtracted from the remaining elements of the column vector. Broadly, this portion of the relinearization process comprises the Keyswitch operation followed by an RNS floor operation. The Keyswitch and RNS floor operations may be defined by:
[0039] Keyswitch:
c.sub.0,i,j=NTT(INTT(c.sub.2,j)mod p.sub.i)⊙k.sub.0,i,j,i,j∈[0,R−1]. (1)
c.sub.0,i←Σ.sub.j=0.sup.R−1c.sub.0,i,j (2)
[0040] RNS Floor:
c.sub.0,i′=NTT(INTT(c.sub.0,R)mod p.sub.i) (3)
out.sub.0,i=(c.sub.0,i−c.sub.0,i′).Math.(p.sub.R.sup.−1).sub.p.sub.
[0041] Where:
NTT(.) represents a transformation from the Inverse Number Theoretic Transform (INTT) domain to the Number Theoretic Transform (NTT) domain;
INTT(.) represents a transformation from the NTT domain to the INTT domain;
mod p.sub.i represents the modular reduction using moduli p.sub.i;
c.sub.2,j are the elements of and input vector C.sub.2;
c.sub.0,i,j are the elements of a first one of the two intermediate matrices;
k.sub.0,i,j are the elements of a first one of the two Keyswitch matrices;
c.sub.0,i are the elements of a first one of two intermediate summation vectors;
out.sub.0,i are the elements of a first one of two output vectors; and
R is the number of RNS components of the input vector C.sub.2.
[0042] The Keyswitch operation includes modular reduction, from vector of polynomials to matrix of polynomials, and modular multiplication, between the matrix of polynomials and the keyswitch key polynomials, and a row summation. The RNS floor operation includes a modular reduction on the last element of the column vector and a modular subtraction, followed by a modular multiplication
[0043] As can be seen from equation (4), each element of the output vector requires element c.sub.0,i′ to be available. From equation (3), the elements c.sub.0,i′ are determined from element c.sub.0,R, which is a vector generated from the last element of the summation column vector. The low-latency process first performs an early calculation of this element so that it is available as soon as the remaining elements are determined. As such, the overall latency of the relinearization process is reduced.
[0044] As depicted in
[0045]
[0046] The low-latency process performs an early calculation on the matrix 214a by summing the elements of the last row to generate a summation term 216a. The term is transformed to the INTT domain 218a and modular reduction performed to generate a vector 220a which is converted back to the NTT domain 222a. By performing the early calculation on the last row of the matrix, instead of waiting to sum the row elements in order, the vector 222a, which is necessary to calculate the output, is available as soon as the other row summations are available.
[0047] The remaining rows, namely the first to second last rows, are summed after the last row. The summation generates elements of a summation column vector 224a. Once summation element is determined, the subtraction 226a between the summation element and the vector from the last row summation can be performed and the result 228a multiplied by the inverse moduli 230 to provide the constant output term 230a. It will be appreciated that the processing may be performed in a pipeline so that processing can be begin on a second polynomial before processing is complete on a first polynomial. For example, a first polynomial vector may be input and as soon as the initial INTT is completed for the first input polynomial vector, a second input polynomial vector may be received and the INTT performed. While the second input polynomial vector is being processed by the INTT stage, the first input polynomial may be processed by the next pipeline stage, which may be the modular reduction and NTT transformation. Further, the above has described determining the constant term 230a vector from the intermediary matrix 214a resulting from the first Keyswitch key 212a. The same process is applied to the matrix 214b to generate the linear term vector 230b. The constant term and linear term vectors 230a, 230b are provided as the output from the low latency relinearization process.
[0048]
[0049] A difference between the two calculation processes is when the RNS-floor process could be started. In the sheet-based calculation, only after all sheets or columns of the matrix are processed can the RNS-floor be started. Additionally, for the sheet-based calculation, the subtraction between c.sub.0,i−c.sub.0,i′ has to wait until the INTT(c.sub.0,R) operation is completed. In contrast, slice-based calculation can avoid this waiting time by dealing with “slice R”, or calculating the summation of the last row of the matrix, first, which enables the INTT(c.sub.0,R) operation to start while proceeding with the summation of other slices at the same time. While this approach reduces the latency, the number of polynomials inside each single slice is one less than that in each sheet, that is the number of columns in the matrix is one less than the number of rows. Hence, the throughput for the slice-based early calculation is slightly reduced.
[0050] The sheet-based ordered calculation and slice-based early calculation approaches lead to two different pipeline designs as depicted in
[0051] In the sheet-based design, the INTT module in “Stage 1” transforms a single RNS component c.sub.2,j at a particular time interval. Hence, R time intervals are required to complete the operation. Following the INTT module is R many NTT modules are used at “Stage 2” since each sheet has R+1 polynomials, but only R many polynomials are transformed back to the time domain since NTT(INTT(c.sub.2,j) mod p.sub.i)=c.sub.2,j when i==j. However, this optimization introduces data dependency and control logic complexity. Another R+1 many multiplication modules are deployed at “Stage 3” to perform the multiplication with the Keyswitch key with each multiplication module handling the multiplication with both Keyswitch keys. After completing all the sheets, the INTT module in “Stage 4” calculates the INTT(c.sub.0,R) and INTT(c.sub.1,R). Another two NTT modules in “Stage 5” and multiplication modules in “Stage 6” completes the RNS-floor operation. The latency of this pipeline design is 2R+4 and the throughput is 1.
[0052] In the Slice-based design, L many INTT modules are required for “Stage 1” since the slice contains all the RNS components of the ciphertext. However, since the INTT operation over each RNS component only needs to be done once, the throughput of this INTT module is reduced to 1/(R+1) such that it matches the throughput of the NTT module in “Stage 2”. R many NTT modules are used since each slice contains R many polynomials. The optimization provided by NTT(INTT(c.sub.2,j) mod p.sub.i)=c.sub.2,j when i==j was not used to avoid the data dependency and control logic complexity. The NTT modules take R+1 many time intervals since there are R+1 many slices. Once converted to the NTT, the ciphertext matrix and the Keyswitch matrices are multiplied by the “Stage 3” Multiplication modules. After the first slice is processed, the two INTT modules in “Stage 4” conduct INTT(c.sub.0,L) and INTT(c.sub.0,R) operations. The “Stage 5” and “Stage 6” are the same as described above. It can be found that during the remaining R time intervals, the INTT module is not processing any valid data until the next relinearization. The throughput of this design is slightly reduced to R/(R+1); however the latency is significantly reduced to R+6.
[0053]
[0054] The first level 404a of the FPGA cluster acts as an input layer that receives the ciphertext vector from the computing device 402. The first level of the FPGA cluster may be provided by a single FPGA device 406 that is configured to provide INTT functionality 408 of Stage 1 described with reference to
[0055] The second level 404b of the FPGA cluster comprises a plurality of FPGA devices 414a, 414b. Although only two FPGAs are depicted in
[0056] The third level 404c of the cluster comprises a plurality of FPGA devices 420a, 420b. Although only two FPGAs are depicted in
[0057] The fourth level 404d of the cluster comprises a single FPGA device 426 although multiple FPGAs could be used. The FPGA 426 of the fourth level 404d may act as an output layer and communicates the relinearization result, namely the constant term and the linear term with the computing device 402. Although a single FPGA device is depicted, multiple FPGA devices could be provided in the fourth level 404d, for example one for computing the constant term, and one for computing the linear term. The FPGA device 426 comprise RNS Floor functionality 428 for performing the RNS floor operation performed by Stages 4-6 described with reference to
[0058] It will be appreciated that not all of the functionality provided by FPGA devices is depicted in
[0059] It will be appreciated by one of ordinary skill in the art that the system and components shown in
[0060] Numerous additional variations on the methods and apparatus of the various embodiments described above will be apparent to those skilled in the art in view of the above description. Such variations are to be considered within the scope.