SYSTEM AND METHOD FOR PRIVACY-PRESERVING DISTRIBUTED TRAINING OF NEURAL NETWORK MODELS ON DISTRIBUTED DATASETS
20230325529 · 2023-10-12
Inventors
- Sinem Sav (Lausanne, CH)
- Juan Ramon Troncoso-Pastoriza (Lausanne, CH)
- Apostolos Pyrgelis (Bretigny-Sur-Morrens, CH)
- David Froelicher (Lausanne, CH)
- Joao Gomes De Sa E Sousa (Renens, CH)
- Jean-Philippe Bossuat (Lausanne, CH)
- Jean-Pierre Hubaux (ST-Sulpice VD, CH)
Cpc classification
H04L9/3026
ELECTRICITY
H04L9/0819
ELECTRICITY
H04L9/0891
ELECTRICITY
International classification
G06F21/62
PHYSICS
H04L9/00
ELECTRICITY
Abstract
A computer-implemented method and a distributed computer system (100) for privacy-preserving distributed training of a global neural network model on distributed datasets (DS1 to DSn). The system has a plurality of data providers (DP1 to DPn) being communicatively coupled. Each data provider has a respective local training dataset (DS1 to DSn) and a vector of output labels (OL1 to OLn) for training the global model. Further, it has a portion of a cryptographic distributed secret key (SK1 to SKn) and a corresponding collective cryptographic public key (CPK) of a multiparty fully homomorphic encryption scheme, with the weights of the global model being encrypted with the collective public key. Each data provider (DP1) computes and aggregates, for each layer of the global model, encrypted local gradients (LG1) using the respective local training dataset (DS1) and output labels (OL1), with forward pass and backpropagation using stochastic gradient descent. At least one data provider homomorphically combines at least a subset of the current local gradients of at least a subset of the data providers into combined local gradients, and updates the weights of the current global model (GM) based on the combined local gradients.
Claims
1. A computer-implemented method for privacy-preserving distributed training of a neural network model—referred to as global model—on a plurality of local datasets, wherein each data provider of a communicatively coupled plurality of data providers holds its own local dataset, each data provider further having: a vector of output labels, and a portion of a cryptographic distributed secret key and a corresponding collective cryptographic public key of a multiparty fully homomorphic encryption scheme, wherein the cryptographic keys are collectively initialized by the plurality of data providers, and the local datasets and output labels of all data provider systems adhere to the same structure and same data formats and use the same training parameters, and wherein the plurality of all local datasets represents a global training dataset for the global model, with the weights of the global model being initialized, and being encrypted with the collective public key; the method comprising: for a predefined maximum number of global iterations: each data provider of at least a receiver-subset of the plurality of data providers, receiving the current encrypted weights of the global model, and computing and aggregating for each layer of the global model, encrypted local gradients of a respective cost function with respect to the weights by processing each sample of a respective local batch of the local dataset with respective output labels, with forward pass and backpropagation using stochastic gradient descent, wherein the local batch size is the number of data samples processed at each iteration, and providing the current encrypted aggregated local gradients to at least a combiner-subset of the data providers; homomorphically combining at least a subset of the current encrypted aggregated local gradients by at least one of the combiner-subset of data providers into combined aggregated gradients; and updating, by the at least one data provider, the weights of the global model by using averaged combined aggregated gradients wherein averaging is performed with respect to the global batch size being the dot product between the local batch size and the number of data providers.
2. The method of claim 1, further comprising: a particular data provider of the plurality of data providers receiving, from a querying entity, input data encrypted with the collective public key, and receiving a destination public key of a destination entity, the destination entity being the recipient of a prediction result to be provided in response to the input data; in response to the input data, obtaining one or more corresponding encrypted prediction values by applying the encrypted global model to the encrypted input data; and switching the one or more encrypted prediction values to the destination public key, and providing the switched one or more encrypted prediction values to the querying entity so that only the destination entity can decrypt the one or more encrypted prediction values.
3. The method of claim 1, further comprising: switching the updated global model to one or more destination public keys of one or more destination entities, and providing the one or more switched global models to respective destination entities for obtaining the corresponding decrypted global model.
4. The method of claim 1, wherein at least a subset of the local datasets is encrypted with the collective public key, and wherein input data samples and output labels of at least a subset of the data providers are encrypted with the collective public key.
5. The method of claim 1, wherein the method is operated in a synchronous mode with the receiver-subset and the combiner-subset corresponding to the plurality of data providers and all data providers compute their local gradients in a single global iteration loop.
6. The method of claim 1, wherein the method is operated in an asynchronous mode with the receiver-subset corresponding to only a portion of the plurality of data providers which compute their local gradients in a first global iteration loop, and one or more further receiver-subsets compute their local gradients in one or more further global iteration loops until all data providers have contributed their local gradient computations.
7. The method of claim 1, wherein the computation of the forward pass and backpropagation uses a crypto-scheme with an alternating packing strategy by iteratively processing each data sample of a batch, wherein alternating packing combines row-based and column-based packing, by vectorizing rows or columns of the weight matrix and packing the vectorized rows or columns into at least one ciphertext.
8. The method of claim 7, wherein the weight matrix of every fully connected layer in the neural network model is packed following the opposite approach from that used to pack the weights of the previous layer in that the rows or columns packed into at least one ciphertext are aligned with the rows or columns of the following layer for the next layer multiplications in the forward pass and for the alignment of multiplication operations in the backpropagation.
9. The method of claim 1, wherein the multiparty fully homomorphic encryption scheme is a quantum-resistant cryptographic scheme relying on parallel computations, single-instruction-multiple-data operations, and optimized polynomial approximations of the activation functions of the machine learning models.
10. The method of claim 1, wherein homomorphically combining the current aggregated local gradients is performed ascending a tree structure, such that each data provider combines its current local gradients with the current local gradients of its children, and sends the combined result to its parent so that the data provider at the root of the tree structure—the root data provider—combines the combined results and updates the weights of the with the combined local gradients.
11. A distributed computer system for privacy-preserving distributed training of a neural network model—referred to as global model—on a plurality of local datasets, the system comprising: a plurality of data providers being communicatively coupled such that each data provider can exchange information with any other data provider, wherein each data provider holds its own local dataset and a vector of output labels, for training the neural network model using an iterative training algorithm, and has a portion of a cryptographic distributed secret key and a corresponding collective cryptographic public key of a multiparty fully homomorphic encryption scheme, the cryptographic keys being collectively initialized by the plurality of data providers, and the local training datasets of all data provider systems adhering to the same structure and same data formats and using the same training parameters, and the plurality of all local training datasets representing a global training dataset for the global model, with the weights of the global model being initialized and being encrypted with the collective public key; wherein each data provider of at least a receiver-subset of the plurality of data providers is configured to receive the current encrypted weights of the global model, and to compute and aggregate, for each layer of the global model, encrypted local gradients of a respective cost function with respect to the weights by processing each sample of a respective local batch of the local dataset with respective output labels, with forward pass and backpropagation using stochastic gradient descent, wherein the local batch size is the number of data samples processed at each iteration, and to provide the current encrypted aggregated local gradients to at least a combiner-subset of the data providers; and wherein at least one data provider of the combiner-subset is configured to homomorphically combine at least a subset of the current encrypted aggregated local gradients by at least one of the subset of data providers into combined aggregated gradients, and to update the weights of the global model by using averaged combined aggregated gradients wherein averaging is performed with respect to the global batch size being the dot product between the local batch size and the number of data providers.
12. The system of claim 11, wherein a particular data provider of the plurality of data providers is configured: to receive, from a querying entity, input data encrypted with the collective public key, and to receive a destination public key of a destination entity, the destination entity being the recipient of a prediction result to be provided in response to the input data; and in response to the input data, to obtain one or more corresponding encrypted prediction values by applying the encrypted global model to the encrypted input data; and to switch the one or more encrypted prediction values to the destination public key, and to provide the switched one or more encrypted prediction values to the querying entity so that only the destination entity can decrypt the one or more encrypted prediction values.
13. The system of claim 11, wherein: at least a subset of the plurality of data providers is configured: to collectively switch the updated global model to one or more destination public keys of one or more destination entities, and to provide the one or more switched global models to respective destination entities for obtaining the corresponding decrypted global model.
14. The system of claim 11, wherein the computation of the forward pass and backpropagation uses a crypto-scheme with an alternating packing strategy by iteratively processing each data sample of a batch, wherein alternating packing combines row-based and column-based packing, by vectorizing rows or columns of the weight matrix and packing the vectorized rows or columns into at least one ciphertext.
15. The system of claim 14, wherein the weight matrix of every fully connected layer in the neural network model is packed following the opposite approach from that used to pack the weights of the previous layer in that the rows or columns packed into at least one ciphertext are aligned with the rows or columns of the following layer for the next layer multiplications in the forward pass and for the alignment of multiplication operations in the backpropagation.
16. The system of claim 11, wherein the multiparty fully homomorphic encryption scheme is a quantum-resistant cryptographic scheme relying on parallel computations, single-instruction-multiple-data operations, and optimized polynomial approximations of the activation functions of the machine learning models.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0054]
[0055]
[0056]
[0057]
[0058]
[0059]
[0060]
[0061]
[0062]
[0063]
DETAILED DESCRIPTION
[0064]
[0065] The plurality of data providers DP1 to DPn is communicatively coupled such that each data provider can exchange information with any other data provider. In the example, the dashed lines represent physical communication channels between pairs of data providers. In the example, not every data provider is coupled with every other data provider. For example, DP1 is coupled with DP5 but has no direct channel with DP2. However, DP1 can exchange information with DP2 via DP5. The dash-dotted lines represent a tree structure which can be formed when all data providers are connected with each other via respective communication channels for efficiently answering queries, which will be explained later. The double-arrows between the general model GM and each data provider symbolize that each data provider receives 1210 the current encrypted weights of the general model, or can at least access the current general model GM at any time to retrieve such weights.
[0066] In the example, only DP1 and DPn are shown with their internal structure. However, the remaining data providers DP2 to DP6 all have the corresponding internal structure. Each DP* (“*” is used as a placeholder for indices 1 to n) has a respective local training dataset DS* (DS1 to DSn) to contribute to the training of the global model GM. An iterative training algorithm IA is used to compute and aggregate 1220, for each layer of the global model GM, encrypted, local gradients LG* of a respective cost function with respect to the weights by processing each sample of a respective local batch of the corresponding local dataset DS* with respective output labels OL*. The data provider systems DP*and their local training datasets DS* and output labels OL* adhere to the same structure and data formats. Further, the same training parameters are used by all data providers for local gradient computation. The plurality of all local training datasets thereby represents a global training dataset for the global model GM.
[0067] Each data provider has a portion of a cryptographic distributed secret key SK1 to SKn and a corresponding collective cryptographic public key CPK of a multiparty fully homomorphic encryption scheme. The cryptographic keys SK1 to SKn, CPK are collectively initialized 1100 by the plurality of data providers DS1 to DSn. The initialization step 1100 is only performed once and is therefore executed before the actual training of the global model starts. Further, the weights of the global model are initialized and encrypted with the collective public key CPK in this initialization step. This can be done by one of the data providers.
[0068] It is assumed that the data providers are willing to contribute their respective data to train the global model by providing their respective local gradients. It is further assumed that the data providers are all communicatively coupled and organized in a topology that enables efficient execution of the computations. For instance, as depicted in
[0069] Although the data providers wish to collaborate for the execution of machine learning workflows, they do not necessarily trust each other. As a result, they seek to protect the secrecy of their data (used for training and evaluation) and of the collectively learned global model. More formally, the following security properties must hold in a passive-adversary model, where at least one data provider does not collude with the others.: [0070] (a) Data Confidentiality: The training data of each data provider DP.sub.i, i.e., (X.sub.i,y.sub.i) and the querier's evaluation data 160 (X′,⋅) should remain only known to their respective owners. To this end, data confidentiality is satisfied as long as the involved parties (data providers and querying entity) do not obtain any information about other parties' inputs other than what can be deduced from the output of the process of training or evaluating a model. [0071] (b) Model Confidentiality: During the training process, no data provider DP, should gain more information about the model that is being trained than what it can learn from its own input data, i.e., (X.sub.i,y.sub.i). During prediction, the querier should not learn anything more about the model than what it can infer from its input data (X′,⋅) and the corresponding predictions y′.
[0072] The data providers (DPs), each of which owns a part of the global training dataset, locally compute and aggregate 1220 encrypted local gradients LG* which are then homomorphically combined 1300 as illustrated in
[0073]
[0074] Mouchet et al. propose a multiparty version of the Brakerski Fan-Vercauteren (BFV) lattice-based homomorphic cryptosystem and introduce interactive protocols for key generation, decryption, and bootstrapping. In one embodiment, the claimed approach uses an adaptation of this multiparty scheme to the Cheon-Kim-Kim-Song cryptosystem (CKKS) [cf. J. H. Cheon, A. Kim, M. Kim, and Y. Song. Homomorphic encryption for arithmetic of approximate numbers. In Springer International Conference on the Theory and Application of Cryptology and Information Security (ASIACRYPT), 2017] that enables approximate arithmetic, and whose security is based on the ring learning with errors (RLWE) problem [cf. V. Lyubashevsky, C. Peikert, and O. Regev. On ideal lattices and learning with errors over rings. In Springer Annual International Conference on the Theory and Applications of Cryptographic Techniques (EUROCRYPT), 2010.]. In the following, the main parameters of the CKKS cryptographic scheme and the multiparty cryptographic operations are described as used in one embodiment. The cited work of Cheon et al. describes the details of the CKKS cryptoscheme and the work of Mouchet et al. provides the complete definition and security of the distributed protocols.
[0075] The CKKS cryptoscheme which is used by the herein disclosed protocols is briefly described in the following. The cyclotomic polynomial ring of dimension N, where N is a power-of-two integer, defines the plaintext and ciphertext space as R.sub.Q.sub.[X]/(X.sup.N+1), with Q.sub.L=Π.sub.0.sup.Lq.sub.i in our case. Each q.sub.i is a unique prime, and Q.sub.L is the ciphertext modulus at an initial level L. Note that a plaintext encodes a vector of up to N/2 values. Below, we introduce the main functions that we use in our system. We denote by c=(c.sub.0,c.sub.1)∈
.sub.Q.sub.
and scale S.sub.p, returns the decoding of p. DDecrypt(c,sk.sub.i): For
and scale S.sub.c, returns the plaintext p
with scale S.sub.c. Enc(pk,
[0076] Turning briefly to
[0077] To summarize the proposed solution to the problem of privacy-preserving distributed learning for a global neural network model, the MapReduce abstraction is used to capture the parallel and repetitive nature of distributed learning tasks. The workflow of the used extended MapReduce abstraction includes the following four phases: the data providers pre-process their data (PREPARE) before they iteratively compute local gradients based on their local training datasets (MAP). Subsequently, they combine the local gradients (COMBINE) and update the global model (REDUCE). The four phases in one embodiment of the extended MapReduce abstraction are illustrated in protocol 1 of
[0078] The PREPARE phase includes lines 1 to 5 of protocol 1 (Collective Training). The data providers P.sub.i collectively agree on the training parameters: the learning parameters: the number of hidden layers (l), the number of neurons (h.sub.j) in each layer j, j 1, 2, . . . , l, the learning rate (η), the number of global iterations (m), the activation functions to be used in each layer (φ( )) and their approximations, and the local batch size (b). Then, the parties generate their secret keys sk.sub.i and collectively generate the public key pk. Subsequently, they optionally perform a collective normalization or standardization on their input data with a secure aggregation protocol described in “D. Froelicher, J. R. Troncoso-Pastoriza, J. S. Sousa, and J. Hubaux. Drynx: Decentralized, secure, verifiable system for statistical queries and machine learning on distributed datasets. IEEE Transactions on Information Forensics and Security, 15:3035-3050, 2020.” Each P.sub.i encodes (packs) its input data samples X.sub.i and output labels y.sub.i as
[0079] The DPs also collectively initialize the cryptographic keys for the distributed CKKS scheme by executing DKeyGen(⋅). Then, they collectively standardize, or normalize, the distributed dataset by obliviously computing the required statistics. The interactive protocols, i.e., Collective Aggregation and Collective Differential Privacy (CDP), and encodings by Froelicher et al. can be adapted to the herein disclosed approach. These encodings define how the operations are locally executed by the DPs on their part of the dataset, so that the encrypted results can be collectively aggregated, obfuscated and combined to obtain the approximate statistics as if it was computed on the whole dataset. For instance, to standardize the dataset, each DP, for each feature, locally computes the sum of its value, the sum of its squared values, and the number of samples. These values are encrypted and then aggregated among all DPs. Then, the DPs collectively and obliviously add randomly sampled noise from a deviation.
[0080] MAP, COMBINE, and REDUCE are repeated m times. During iteration k, in MAP (lines 7 to 10 of protocol 1), each P.sub.i receives the current weights W.sub.1,⋅.sup.k to W.sub.l,⋅.sup.k and performs the local gradient descent computation. In COMBINE (lines 11 to 12 of protocol 1), the local gradient contributions of each P.sub.i, are securely combined. In REDUCE (lines 13 to 15 of protocol 1), P.sub.1 updates the weights of the global model.
[0081] The local gradient descent computation (LGD) in the MAP phase (lines 9 and 10 of protocol 1) is illustrated in more detail in the example protocol 2 of
[0082] In the example protocol 1, in the COMBINE phase (lines 11 and 12 of protocol 1), each party communicates its encrypted local gradients to their parent, and each parent homomorphically sums the received gradients with their own ones. At the end of this phase, P.sub.1 receives the globally aggregated gradients.
[0083] In the example protocol 1, in the REDUCE phase (lines 13 to 15 of protocol 1), P.sub.1 updates the global model weights by using the averaged aggregated gradients. The averaging is done with respect to the global batch size B=b×N.
[0084] Training Termination: the learning process may be stopped after a predefined number of epochs. Other well-known methods may be used instead as discussed further down below.
[0085] At the end of the training phase, the model is kept in an encrypted form such that no individual party or the querier can access the model weights. To enable oblivious inference, the querier encrypts its evaluation data X.sub.q with the parties' collective key. An oblivious inference is equivalent to one forward pass (see Protocol 2), except that the first plaintext multiplication (Mul.sub.pt( ) of L.sub.0 with the first layer weights is substituted with a ciphertext multiplication (Mul.sub.ct( ). At the end of the forward pass, the parties collectively re-encrypt the result with the querier's public key by using the key-switch functionality of the underlying MHE scheme. Thus, only the querier is able to decrypt the prediction results. Note that any party P.sub.i can perform the oblivious inference step, but the collaboration between all the parties is required to perform the distributed bootstrap and key-switch functionalities.
[0086] Example protocol 3 of
[0087] For the efficient computation of the forward pass and backpropagation described in Protocol 2, the packing capabilities of the crypto-scheme enable Single Instruction, Multiple Data (SIMD) operations on ciphertexts. Packing enables coding a vector of values in a ciphertext and to parallelize the computations across its different slots, thus significantly improving the overall performance. Existing packing strategies that are commonly used for machine learning operations on encrypted data (e.g., the row-based or diagonal packing), require a high number of rotations for the execution of the matrix-matrix multiplications and matrix transpose operations, performed during the forward and backward pass of the local gradient descent computation (see Protocol 2).
[0088] The number of rotations has a significant effect on the overall training time of a neural network on encrypted data, as they require costly key-switch operations as discussed in the complexity analysis further down below. As an example, the diagonal approach scales linearly with the size of the weight matrices, when it is used for batch-learning of neural networks, due to the matrix transpose operations in the backpropagation.
[0089] Herein, a different packing approach is used which processes each batch sample one by one, making the execution embarrassingly parallelizable (“embarrassingly parallel problem” is also referred to as perfectly parallel, delightfully parallel or pleasingly parallel problem in literature). This allows to optimize the number of rotations, to eliminate the transpose operation applied to matrices in the backpropagation, and to scale logarithmically with the dimension and number of neurons in each layer. In other words, the herein disclosed alternating packing approach/strategy enables: [0090] a) better parallelization, [0091] b) reduction in the number of used rotations (which are the bottleneck for complexity), [0092] c) avoiding transpose operations in backpropagation (computationally complex operations), and [0093] d) better scaling with the dimension and number of layers (logarithmic vs linear).
[0094] The proposed alternating packing (AP) approach combines row-based and column-based packing, i.e., rows or columns of the matrix are vectorized and packed into one ciphertext. In particular, the weight matrix of every fully connected (FC) layer in the neural network is packed following the opposite approach from that used to pack the weights of the previous layer. With the AP approach, the number of rotations scales logarithmically with the dimension of the matrices, i.e., the number of features (d), and the number of hidden neurons in each layer (h.sub.i). For this, the matrices are padded with zeros to get power-of-two dimensions. In addition, the AP approach reduces the computational cost of transforming the packing between two consecutive layers. Protocol 3 depicted in
[0095] In steps 2-6 of Protocol 3, each party P.sub.i prepares the inputs X.sub.i by replicating and encoding them in vectors. For networks with an odd number of layers, in steps 7-10, each party P.sub.i prepares the labels y.sub.i by introducing a gap of zeros between values. In step 11, each party P.sub.i encodes the labels. “gap” denotes a vector of zeros. “|.|” denotes the size of a vector or the number of rows of a matrix. Replicate(v,k,gap) returns a vector that replicates v, k times with a gap in between each replica. Flatten(W, gap, dim), flattens the rows or columns of a matrix W into a vector and introduces a gap in between each row/column. If a vector is given as input to this function, it places gap in between all of its indices. The argument dim indicates flattening of rows (‘r’) or columns (‘c’), and dim=‘.’ for the case of vector inputs.
[0096] The weight initialization (steps 13 to 28 of Protocol 3) is performed by P.sub.1 by doing row-packing for even layers (steps 16 to 20) and column-packing for odd layers (steps 22 to 26). That is, the rows (or columns) packed into one ciphertext are aligned with the rows (or columns) of the following layer for the next layer multiplications in the forward pass and for the alignment of multiplication operations in the backpropagation, as depicted in the table illustrated by
[0097] Convolutional Layer Packing.
[0098] To optimize the SIMD operations for convolutional (CV) layers, we the nth input sample X.sub.i[n] is decomposed into t smaller matrices that are going to be convoluted with the weight matrix. These decomposed flattened matrices are packed into one ciphertext, with a gap in between each matrix that is defined with respect to the number of neurons in the next layer, similarly to the AP approach. The weight matrix is then replicated t times with the same gap between each replica.
[0099] Protocol 5 of
[0100] Downsampling (Pooling) Layers.
[0101] As there is no weight matrix for downsampling layers, they are not included in the offline packing phase.
[0102] Approximated Activation Functions.
[0103] For the encrypted evaluation of non-linear activation functions, such as Sigmoid or Softmax, least-squares approximations may be used. Further, the optimized polynomial evaluation (cf. D. Froelicher, J. R. Troncoso-Pastoriza, A. Pyrgelis, S. Say, J. S. Sousa, J.-P. Bossuat, and J.-P. Hubaux. Scalable privacy-preserving distributed learning, 2020) that consumes log(d.sub.a+1) levels for an approximation degree d.sub.a. For the piece-wise function ReLU, we approximate the smooth approximation of ReLU, softplus (SmoothReLU), φ(x)=ln(1+e.sup.x) with least-squares. Lastly, derivatives of the approximated functions can be used.
[0104] To achieve better approximation with the lowest possible degree, two approaches may be applied to keep the input range of the activation function as small as possible, by using (i) different weight initialization techniques for different layers (i.e., Xavier or He initialization), and (ii) collective normalization of the data by sharing and collectively aggregating statistics on each party's local data in a privacy-preserving way.
[0105] For the piece-wise function ReLU, the following alternatives may be used: [0106] (i) approximation of square-root for the evaluation of φ(x)=0.5(b+√{square root over (b.sup.2)}) that is equivalent to ReLU, and [0107] (ii) (ii) approximating the smooth approximation of ReLU (SmoothReLU), or softplus, φ(x)=ln(1+e.sup.x), both with least-squares.
The latter achieves a better approximation for a degree d.sub.a=3, whereas the former approximates better the exact ReLU if one increases the multiplicative depth by 1 and uses d.sub.a=7. In our evaluations, we use SmoothReLU for efficiency.
[0108] We note that the derivative of softplus is a sigmoid function, and we evaluate the approximated sigmoid as the derivative, as this achieves better accuracy. Finally, functions can also be approximated by using Chebyshev interpolants, implementing a level- and product-efficient algorithm to evaluate polynomials in standard or Chebyshev basis. The least-squares is the optimal solution for minimizing the squared error over an interval, whereas Chebyshev asymptotically minimizes the maximum error. Hence, Chebyshev is more appropriate for keeping the error bounded throughout the whole interval, but requires a larger degree for a high accuracy approximation.
[0109] Finally, the interval and the degree of the approximations are chosen based on the heuristics on the data distribution in a privacy-preserving way, as described in “E. Hesamifard, H. Takabi, M. Ghasemi, and R. Wright. Privacy-preserving machine learning as a service. Proceedings on Privacy Enhancing Technologies, 2018:123-142, 06 2018.”
[0110] Cryptographic Building Blocks.
[0111] In the following, each cryptographic function used in an example embodiment to enable the privacy-preserving training of NNs with N parties is described. Further, the optimizations employed to avoid costly transpose operations in the encrypted domain is discussed.
[0112] Rotations.
[0113] When relying on packing capabilities, computation of the inner-sum of vector-matrix multiplications and transpose operation implies a restructuring of the vectors, that can only be achieved by applying slot rotations. In the herein disclosed embodiments, two types of rotation functions are used: (i) Rotate For Inner Sum (RIS(c,p,s)) is used to compute the inner-sum of a packed vector c by homomorphically rotating it to the left with RotL(c,p) and by adding it to itself iteratively log 2(s) times, and (ii) Rotate For Replication (RR(c,p,s)) replicates the values in the slots of a ciphertext by rotating the ciphertext to the right with RotR(c,p) and by adding to itself, iteratively log 2(s) times. For both functions, p is multiplied by two at each iteration, thus both yield log 2(s) rotations. As rotations are costly cryptographic functions, and the matrix operations required for NN training require a considerable amount of rotations, the number of executed rotations can be minimized by leveraging a modified bootstrapping operation, that automatically performs some of the required rotations.
[0114] Distributed Bootstrapping with Arbitrary Linear Transformations.
[0115] To execute the high-depth homomorphic operations required for training NNs, bootstrapping is required several times to refresh a ciphertext, depending on the initial level L.
[0116] In one embodiment, a distributed version of bootstrapping is used (cf. C. Mouchet, J. R. Troncoso-Pastoriza, J.-P. Bossuat, and J. P. Hubaux. Multiparty homomorphic encryption: From theory to practice. In Technical Report https://eprint.iacr.org/2020/304, 2019), as it is several orders of magnitude more efficient than the traditional centralized bootstrapping. It is however modified, to leverage on the interaction to automatically perform some of the rotations, or pooling operations, embedded as transforms in the bootstrapping.
[0117] Mouchet et al. replace the expensive bootstrap circuit by a one-round protocol where the parties collectively switch a Brakerski/Fan-Vercauteren (BFV) ciphertext [cf. J. Fan and F. Vercauteren. Somewhat practical fully homomorphic encryption. Cryptology ePrint Archive, Report 2012/144, 2012. https://eprint.iacr.org/2012/144] to secret-shares in .sub.t.sup.N. Since the BFV encoding and decoding algorithms are linear transformations, they can be performed without interaction on a secret-shared plaintext. Despite its properties, the protocol that Mouchet et al. propose for the BFV scheme cannot be directly applied to CKKS, as CKKS is a leveled scheme. The re-encryption process extends the residue number system (RNS) basis from Q to Q.sub.L. Modular reduction of the masks in Q will result in an incorrect encryption. A solution to this limitation is to collectively switch the ciphertext to a secret-shared plaintext with statistical indistinguishability.
[0118]
[0119] Optimization of the Vector-Transpose Matrix Product
[0120] The backpropagation step of the local gradient computation at each party requires several multiplications of a vector (or matrix) with the transposed vector (or matrix) (see Lines 11-13 of Protocol 2). The naïve multiplication of a vector v with a transposed weight matrix W.sup.T that is fully packed in one ciphertext, requires converting W of size g×k, from column-packed to row-packed. This is equivalent to applying a permutation of the plaintext slots, that can be expressed with a plaintext matrix W.sub.gk×gk and homomorphically computed by doing a matrix-vector multiplication. As a result, a naïve multiplication requires √{square root over (g×k)} rotations followed by log 2(k) rotations to obtain the inner sum from the matrix-vector multiplication. Various implementations may be used to reduce the number of rotations when computing the multiplication of a packed matrix (to be transposed) and a vector:
[0121] In one implementation, for the mini-batch gradient descent, no operations are performed on the batch matrix. Instead, each batch sample is processed in parallel, because having separate vectors (instead of a matrix that is packed into one ciphertext) enables to reorder them at a lower cost. This approach translates I matrix transpose operations to be transposes in vectors (the transpose of the vectors representing each layer activations in the backpropagation, see Line-13, Protocol 2).
[0122] Instead of taking the transpose of the weight matrix, one replicates the values in the vector that will be multiplied with the transposed matrix (for the operation in Line-11, Protocol 2), leveraging the gaps between slots with the AP approach. That is, for a vector v of size k and the column-packed matrix W of size g k, v has the form [a, 0, 0, 0 . . . , b, 0, 0, 0, . . . , c, 0, 0, 0, . . . ] with at least k zeros in between the non-zero a, b, c, . . . values (due to Protocol 3). Hence, any resulting ciphertext requiring the transpose of the matrix that will be subsequently multiplied, will also include gaps in between values. We apply RR(v,1,k) that consumes log 2(k) rotations to generate [a, a, a, . . . 0 . . . , b, b, b, . . . , 0 . . . , c, c, c, . . . , 0, . . . ]. Finally, we compute the product P=Mul.sub.ct(v,W) and apply RIS(P,1,g) to get the inner sum with log 2(g) rotations. Further, the performance is optimized by using DBootstrapALT( )(Protocol 4): If the ciphertext before the multiplication must be bootstrapped, we embed the log 2(k) rotations as a linear transformation performed during the bootstrapping.
[0123] Execution Pipeline
[0124] Table 601-602 (concatenation of the sub-tables 601, 602 in
[0125] In the forward pass (steps F1-F10 in sub-tables 601 and 602), each P.sub.i evaluates the two layers for their inputs X.sub.i, by performing ciphertext-plaintext vector-matrix products (in the first layer) and ciphertext-ciphertext vector-matrix products (in all other layers), rescalings and rotations, and evaluating the activation function at each layer. Steps F1-F5 correspond to the first layer, and steps F6-F10 correspond to the second layer. We introduce a distributed bootstrap (DBootstrap(.)) step in F9, right before the last activation function. For networks with more than two layers, this step can be introduced at several layers, depending on the used parameterization and complexity trade-off. In the backward pass (steps B1-B14 in sub-tables 601 and 602), each P.sub.i computes the gradient vectors for the coefficients of the two layers, by using the values computed in the forward pass and their prepared labels y.sub.i. This is done by performing ciphertext-ciphertext vector-matrix products, rescalings, rotations, and a final averaging. Steps B1-B5 correspond to the second (last) layer, and steps B6-B14 correspond to the first layer. We introduce a bootstrap operation at the end of the first layer (DBootstrapALT(.)), in B11. Each forward and backward pass on a layer in the pipeline consumes one Rotate For Inner Sum (RIS(⋅)) and one Rotate For Replication (RR(⋅)) operation, except for the last layer, as the labels are prepared according to the shape of the t-th layer output. In Table 601-602, it is assumed that the initial level L=7. When a bootstrapping function is followed by a masking (that is used to eliminate unnecessary values during multiplications) and/or several rotations, these operations are performed embedded as part of the distributed bootstrapping (DBootstrapALT( ) to minimize their computational cost. The steps B11, B12, and B13 are the operations embedded in the DBootstrapALT( ) The complexity of each cryptographic function is analyzed in the complexity analysis section further down.
[0126] Finally, steps U1-U3 in table 601-602 represent the REDUCE step performed at P.sub.1. The COMBINE step is omitted, as it is composed of just a homomorphic addition of the computed local gradients. After it, P.sub.1 rescales the gradients, adds them to the global model, and bootstraps the result, that is then used for the next iteration.
[0127] Convolutional Layers.
[0128] As the kernel is flattened, replicated, and packed in one ciphertext, a CV layer follows the exact same execution pipeline as an FC layer. However, the number of RIS(⋅) operations for a CV layer is smaller than for an FC layer. That is because the kernel size is usually smaller than the number of neurons in an FC layer. For a kernel of size h=f x f, the inner sum is calculated by ┌log.sub.2(f)┐ rotations. Note that when a CV layer is followed by an FC layer, the output of the i.sup.th CV layer (Li) already gives the flattened version of the matrix in one ciphertext. RR(L.sub.i,1,h.sub.i+1) is applied for the preparation of the next layer multiplication. When a CV layer is followed by a pooling layer, however, the RR(⋅) operation is not needed, as the pooling layer requires a new arrangement of the slots of Li. This costly operation can be avoided by passing Li to DBootstrapALT(.), and by embedding both the pooling and its derivative in DBootstrapALT(.).
[0129] Pooling Layers.
[0130] The system is evaluated based on average pooling as it is the most efficient type of pooling that can be evaluated under encryption. To do so, the modified collective bootstrapping is exploited to perform arbitrary linear transformations. Indeed, the average pooling is a linear function, and so is its derivative (note that this is not the case for the max pooling). Therefore, in the case of a CV layer followed by a pooling layer, DBootstrapALT(⋅) is applied and is used to rearrange the slots and to compute the convolution of the average pooling in the forward pass and its derivative used later in the backward pass. For a h=f×f kernel size, this saves ┌log.sub.2(h)┐ rotations and additions (RIS(⋅)) and one level if masking is needed. For max/min pooling, which are non-linear functions, evaluating these functions by using encrypted arithmetic remains unpractical due to the need of high-precision approximations.
[0131] Complexity Analysis
[0132] Table 700 of
[0133] N, α, L, L.sub.c, d.sub.a stand for the cyclotomic ring size, the number of secondary moduli used during the key-switching, maximum level, current level, and the approximation degree, respectively.
[0134] The communication complexity of the system depends solely on the number of parties (N), the number of total ciphertexts sent in each global iteration (z), and the size of one ciphertext (c). The building blocks that do not require communication are indicated as “-”. In table 700, forward and backward passes represent the per-layer complexity for FC layers, so they are an overestimate for CV layers. The number of multiplications differs in a forward pass and a backward pass, depending on the packing scheme, e.g., if the current layer is row-packed, it requires 1 less Mul.sub.ct( ) in the backward pass, and one has 1 less Mul.sub.pt(⋅) in several layers, depending on the masking requirements. Furthermore, the last layer of forward pass and the first layer of backpropagation take 1 less RR(⋅) operation that is gained from packing the labels in the offline phase, depending on the NN structure (see Protocol 3). Hence, 2 log.sub.2(h.sub.l) rotations per one LGD computation can be saved.
[0135] In the MAP phase, the complexity of the local computations per P.sub.i is provided, depending on the total number of layers I. In the COMBINE phase, each P.sub.i performs an addition for the collective aggregation of the gradients in which the complexity is negligible. To update the weights, REDUCE is done by one party (P.sub.1) and divisions do not consume levels when performed with SetScale(.).
[0136] The complexity of an activation function (φ(⋅)) depends on the approximation degree d.sub.a. The derivative of the activation function (φ(⋅)) has the same complexity as φ(⋅) with degree d.sub.a−1. For the cryptographic primitives represented in table 700, the CKKS variant of the MHE cryptosystem is used, and the dominating terms are reported. The distributed bootstrapping takes 1 round of communication and the size of the communication scales with the number of parties (N) and the size of the ciphertext.
[0137] Parameter Selection
[0138] Firstly, several details are discussed to optimize the number of Res(⋅) operations and give a cost function which is computed by the complexities of each functionality presented in table 700. Finally, relying on this cost function an optimization problem is formulated for choosing the system parameters.
[0139] It is assumed that each multiplication is followed by a Res(⋅) operation. The number of total rescaling operations, however, can be further reduced by checking the scale of the ciphertext. When the initial scale S is chosen such that Q/S=r for a ciphertext modulus Q, the ciphertext is rescaled after r consecutive multiplications. This reduces the level consumption and is integrated into the cost function hereinafter.
[0140] Cryptographic Parameters Optimization.
[0141] The overall complexity of an l-layer MLP aiming to formulate a constrained optimization problem for choosing the cryptographic parameters is defined. Firstly, the total number of bootstrapping operations () required in one forward and backward pass is introduced, depending on the multiplicative depth as
for a ciphertext modulus Q and an initial scale S.
[0142] The number of total bootstrapping operations is calculated by the total number of consumed levels(numerator), the level requiring a bootstrap (L−τ) and r which denotes how many consecutive multiplications are allowed before rescaling (denominator). The initial level of a fresh ciphertext L has an effect on the design of the protocols, as the ciphertext should be bootstrapped before the level L.sub.c reaches a number (L−τ) that is close to zero, where r depends on the security parameters. For a cyclotomic ring size, the initial level of a ciphertext L, and for the fixed neural network parameters such as the number of layers I, the number of neurons in each layer h.sub.1, h.sub.2, . . . , h, and for the number of global iterations m, the overall complexity is defined as
[0143] The complexity of each KS(.) operation depends on the level of the ciphertext that it is performed on (see table 700), but the initial level L is used in the cost function for the sake of clarity. The complexity of Mul.sub.ct, Mul.sub.pt, DB, and KS is defined in 700. Then, the optimization problem for a fixed scale (precision) S and a security level X, which defines the security parameters, can be formulated as
where postQsec(Q,L,λ) gives the necessary cyclotomic ring size, depending on the ciphertext modulus (Q) and on the desired security level (X), according to the homomorphic encryption standard whitepaper [cf. M. Albrecht, M. Chase, H. Chen, J. Ding, S. Goldwasser, S. Gorbunov, S. Halevi, J. Hoffstein, K. Laine, K. Lauter, S. Lokam, D. Micciancio, D. Moody, T. Morrison, A. Sahai, and V. Vaikuntanathan. Homomorphic Encryption Security Standard. Technical report, HomomorphicEncryption.org, November 2018].
[0144] Eq. (1) gives the optimal N and L for a given NN structure. Then, each weight matrix is packed into one ciphertext. It is worth mentioning that the solution might give an N that has fewer slots than the required number to pack the big weight matrices in the neural network. In this case, a multi-cipher approach may be used to pack the weight matrix using more than one ciphertext and do the operations in parallel.
[0145] Multi-cipher Approach.
[0146] In the case of a big weight matrix, the flattened weight vector is divided into multiple ciphertexts and the neural network operations are carried out on several ciphertexts in parallel. E.g., for a weight matrix of size 1,024×64 and N/2=4,096 slots, we divide the weight matrix into 1,024×64/4,096=16 ciphers.
[0147]
[0148] In the embodiment of
[0149] In response to the input data, the receiving data provider DP1 obtains 1740 one or more corresponding encrypted prediction values by applying the encrypted global model to the encrypted input data. Finally, the data provider DP1 switches 1760 the one or more encrypted prediction values to the destination public key, and provides 1780 the switched one or more encrypted prediction values to the querying entity QE so that only the destination entity DE can decrypt the one or more encrypted prediction values.
[0150] The input data (X′,⋅) of the querying entity is encrypted with the collective public key pk. Then, a forward pass is executed with the global model neural network, following the Alternate Packing approach: the encrypted X′ is multiplied with the weights of the trained model's first layer W.sub.1,⋅ and processed through the activation function in the first layer φ.sub.1(⋅); the results subsequently pass through the next layers (through W.sub.i, φ.sub.i(⋅), with i=2, . . . , l), until obtaining the encrypted prediction values from the result of the last layer y′ (one prediction per row of X′). The prediction results encrypted under CPK are then switched to the destination public key DPK using DKeySwitch(⋅), so that only the destination entity can decrypt them.
[0151] In more general words, the encrypted model W is used by one DP to perform predictions (y′) on the querier's encrypted evaluation data X and the querying entity cannot learn anything more about the model apart from what it can infer from its input data (X′,⋅) and the corresponding predictions y′.
[0152]