Cryptographic Pseudonym Mapping Method, Computer System, Computer Program And Computer-Readable Medium

Abstract

The invention is a cryptographic pseudonym mapping method for an anonymous data sharing system, the method being adapted for generating a pseudonymised database (DB) from data relating to entities and originating from data sources (DS.sub.i), wherein the data are identified at the data sources (DS.sub.i) by entity identifiers (D) of the respective entities, and wherein the data are identified in the pseudonymised database (DB) by pseudonyms (P) assigned to the respective entity identifiers (D) applying a one-to-one mapping, irrespective of the originating data source. According to the invention, more than one mapper (M.sub.j) is applied, and a respective pseudonym (P) is generated by sequentially performing, in a permutation of the mappers (M.sub.j), a number k of mappings utilizing the mapping cryptographic keys (h.sub.ij) of the mappers (M.sub.j) belonging to the particular data source (DS.sub.i) on each encrypted entity identifier (C.sub.i0) encrypted by the data source (DS.sub.i).

Claims

1. A cryptographic pseudonym mapping method for an anonymous data sharing system, the method being adapted for generating a pseudonymised database (DB) from data relating to entities and originating from data sources (DS.sub.i), wherein the data are identified at the data sources (DS.sub.i) by entity identifiers (D) of the respective entities, and wherein the data are identified in the pseudonymised database (DB) by pseudonyms (P), characterised in that the pseudonyms (P) are assigned to the respective entity identifiers (D) applying a one-to-one mapping, irrespective of the originating data source, by applying more than one, a number k of mappers (M.sub.j), selecting for each data source (DS.sub.i) elements (d.sub.ij, a.sub.ij) in a number equal to the number of mappers (M.sub.j) from a predetermined algebraic structure constituting a multiplicative or an additive cyclic group, of which elements (d.sub.ij, a.sub.ij) one element (d.sub.ij, a.sub.ij) is sent, while being kept secret and kept assigned to the data source (DS.sub.i), to each mapper (M.sub.j), calculating from the plurality of elements (d.sub.ij, a.sub.ij) an inverse cryptographic key of said plurality, and transforming, each entity identifier (D) to be mapped, into a respective encrypted entity identifier (C.sub.i0) by using the inverse cryptographic key as an own secret cryptographic key (e.sub.i) of the data source (DS.sub.i), generating for each mapper (M.sub.j) its mapping cryptographic key (h.sub.ij) corresponding to the data source (DS.sub.i) by using the element (d.sub.ij, a.sub.ij) that was sent to the mapper (M.sub.j) and an element (b.sub.j) selected randomly from the algebraic structure and kept secret by the mapper (M.sub.j), and generating respective pseudonyms (P) by sequentially performing, in a permutation of the mappers (M.sub.j), a number k of mappings utilizing the mapping cryptographic keys (h.sub.ij) of the mappers (M.sub.j) belonging to the particular data source (DS.sub.i) on each encrypted entity identifier (C.sub.i0) encrypted by the data source (DS.sub.i).

2. The method of claim 1, characterised by applying an algebraic structure constituting a multiplicative cyclic group, wherein values are represented by residue classes modulo N, for which algebraic structure constants N=p.Math.q and φ(N)=(p−1).Math.(q−1) are predetermined, where p and q are randomly selected prime numbers, and φ(N) is the value of the Euler function obtained for N, for generating the own cryptographic key (e.sub.i) of each data source (DS.sub.i), randomly selected factors that are relatively primes to the modulus φ(N) are generated as elements (d.sub.ij) in a number corresponding to the number of the mappers (M.sub.j), the multiplicative inverse for φ(N) taken as a modulus of the modulo product (d.sub.i) of the randomly selected factors is obtained in a manner known per se, and said multiplicative inverse value is chosen as the own cryptographic key (e.sub.i) of the data source (DS.sub.i), for which value the formula e.sub.id.sub.i ≡1 mod φ(N) holds true, the elements (d.sub.ij) are sent encrypted to the mappers (M.sub.j), the mappers (M.sub.j) are applied for decrypting their respective own elements (d.sub.ij), and the mapping cryptographic key (h.sub.ij) of each mapper (M.sub.j) corresponding to the data source (DS.sub.i) is generated applying the formula h.sub.ij=d.sub.ijb.sub.j mod φ(N), where the randomly selected secret element (b.sub.j) is relatively prime to (φ(N), the encrypted entity identifier (C.sub.i0) is computed by the data source (DS.sub.i) utilizing the formula C.sub.i0 D.sup.e.sup.i mod N, and the sequential mappings of the mappers (M.sub.j) are performed, from (s=0) to (s=k−1) applying the formula C.sub.i,s+1.sup.(j)=C.sub.i,s.sup.h.sup.ij mod N, where P=C.sub.ik,

3. The method of claim 2, characterised in that the randomly selected prime numbers p and q can be represented utilizing half the number of bits of a chosen key size.

4. The method of claim 2, characterised in that the encrypted entity identifier (C.sub.i0) is shared with the mappers (M.sub.j) by writing it into a database that operates according to a protocol verified by third parties and provides decentralized authenticity.

5. The method of claim 4, characterised in that a blockchain database is applied as the database providing decentralized authenticity.

6. The method of claim 2, characterised in that the constants N and φ(N) of the algebraic structure are generated by a key manager (KM), of which φ(N) is kept secret, and the own cryptographic key (e.sub.i) of the data source (DS.sub.i) and the elements (d.sub.ij) corresponding thereto are generated by the key manager (KM) and are sent encrypted to the data source (DS.sub.i) and to the mappers (M.sub.j), wherein a prime number is chosen as randomly selected secret element (b.sub.j).

7. The method of claim 1, characterised by applying an algebraic structure constituting an additive cyclic group, wherein values are represented by points of elliptic curves defined over a number field of residue classes modulo p, where p is a prime number, for which algebraic structure the following constants are predetermined: parameters A, B of the formula y.sup.2=x.sup.3+Ax+B mod p defining the points of an elliptic curve defined over the residue classes of the prime number p, and a point G of the curve that has an order q that is greater than the number of entity identifiers (D), for generating the own cryptographic key (e.sub.i) of each data source (DS.sub.i), elements (a.sub.ij) are selected from the residue classes mod q as elements (d.sub.ij) corresponding in number to the number of the mappers (M.sub.j), and the sum (a.sub.i) of the elements is chosen as the own cryptographic key (e.sub.i) of the data source (DS.sub.i), for which the formula e.sub.i=a.sub.i=Σ.sub.j=1.sup.ka.sub.ij, the elements (a.sub.ij) are sent encrypted to the mappers (M.sub.j), the mappers (M.sub.j) are applied for decrypting their respective own elements (d.sub.ij), and the mapping cryptographic key (h.sub.ij) of each mapper (M.sub.j) corresponding to the data source (DS.sub.i) is generated applying the formula h.sub.ij=a.sub.ij+b.sub.j, where the randomly selected secret element (b.sub.j) is a value of the residue classes mod q and is different from zero and one, the encrypted entity identifier (C.sub.i0) is computed by the data source (DS.sub.i) utilizing the formula C.sub.i0 M⊕a.sub.iG, where operator ⊕ is the sum of the points of the elliptic curve, and the sequential mappings of the mappers (M.sub.j) are performed, from (s=0) to (s=k−1) applying the formula C.sub.i,s+1.sup.(j)=C.sub.i,s ⊖h.sub.ijG, where A⊖B=A⊕(—B) and P=C.sub.ik.

8. The method of claim 7, characterised in that the encrypted entity identifier (C.sub.i0) is shared with the mappers (M.sub.j) by writing it into a database that operates according to a protocol verified by third parties and provides decentralized authenticity.

9. The method of claim 8, characterised in that a blockchain database is applied as the database providing decentralized authenticity.

10. The method of claim 7, characterised in that the constants of the algebraic structure are generated by a key manager (KM), and the own cryptographic key (e.sub.i) of the data source (DS.sub.i) and the elements (a.sub.ij) corresponding thereto are generated by the key manager (KM) and are sent encrypted to the data source (DS.sub.i) and to the mappers (M.sub.j).

11. The method of claim 1, characterised in that the mappers (M.sub.j) constitute a decentralized network.

12. A computer system for cryptographic pseudonymisation, the system comprising: data sources (DS.sub.i) comprising data relating to entities, the data being identified at the data sources (DS.sub.i) by entity identifiers (D) of the entities, and a pseudonymised database (DB), in which the data are identified by pseudonyms (P), characterised in that the pseudonyms (P) are assigned to the respective entity identifiers (D) applying a one-to-one mapping, irrespective of the originating data source, and the system further comprises more than one, a number k of mappers (M.sub.j), a module adapted for selecting for each data source (DS.sub.i) elements (d.sub.ij, a.sub.ij) in a number of equal to the number of mappers (M.sub.j) from a predetermined algebraic structure constituting a multiplicative or an additive cyclic group, a module adapted for sending to each mapper (M.sub.j) one of the elements (d.sub.ij, a.sub.ij), the sending module being configured to send the element (d.sub.ij, a.sub.ij) by keeping it secret and keeping it assigned to the data source (DS.sub.i), a module adapted for calculating from the plurality of elements (d.sub.ij, a.sub.ij) an inverse cryptographic key of said plurality, a module for transforming each entity identifier (D) to be mapped into a respective encrypted entity identifier (C.sub.i0) utilizing the inverse cryptographic key as the own secret cryptographic key (e.sub.i) of the data source (DS.sub.i), a module adapted for generating, for each mapper (M.sub.j), a mapping cryptographic key (h.sub.ij) thereof corresponding to the data source (DS.sub.i) utilizing the element (d.sub.ij, a.sub.ij) that was sent to the mapper (M.sub.j) and an element (b.sub.j) selected randomly from the algebraic structure and kept secret by the mapper (M.sub.j), and a module adapted for generating a respective pseudonym (P) for each encrypted entity identifier (C.sub.i0) encrypted by the data source (DS.sub.i) by sequentially performing, in a permutation of the mappers (M.sub.j), a number k of mappings utilizing the mapping cryptographic keys (h.sub.ij) of the mappers (M.sub.j) corresponding to the particular data source (DS.sub.i).

13. The computer system of claim 12, characterised in that it comprises modules adapted for performing a cryptographic pseudonym mapping method for an anonymous data sharing system, the method being adapted for generating a pseudonymised database (DB) from data relating to entities and originating from data sources (DS.sub.i), wherein the data are identified at the data sources (DS.sub.i) by entity identifiers (D) of the respective entities, and wherein the data are identified in the pseudonymised database (DB) by pseudonyms (P), characterised in that the pseudonyms (P) are assigned to the respective entity identifiers (D) applying a one-to-one mapping, irrespective of the originating data source, by applying more than one, a number k of mappers (M.sub.j), selecting for each data source (DS.sub.i) elements (d.sub.ij, a.sub.ij) in a number equal to the number of mappers (M.sub.j) from a predetermined algebraic structure constituting a multiplicative or an additive cyclic group, of which elements (d.sub.ij, a.sub.ij) one element (d.sub.ij, a.sub.ij) is sent, while being kept secret and kept assigned to the data source (DS.sub.i), to each mapper (M.sub.j), calculating from the plurality of elements (d.sub.ij, a.sub.ij) an inverse cryptographic key of said plurality, and transforming, each entity identifier (D) to be mapped, into a respective encrypted entity identifier (C.sub.i0) by using the inverse cryptographic key as an own secret cryptographic key (e.sub.i) of the data source (DS.sub.i), generating for each mapper (M.sub.j) its mapping cryptographic key (h.sub.ij) corresponding to the data source (DS.sub.i) by using the element (d.sub.ij, a.sub.ij) that was sent to the mapper (M.sub.j) and an element (b.sub.j) selected randomly from the algebraic structure and kept secret by the mapper (M.sub.j), and generating respective pseudonyms (P) by sequentially performing, in a permutation of the mappers (M.sub.j), a number k of mappings utilizing the mapping cryptographic keys (h.sub.ij) of the mappers (M.sub.j) belonging to the particular data source (DS.sub.i) on each encrypted entity identifier (C.sub.i0) encrypted by the data source (DS.sub.i).

14. The computer system of claim 13, characterised in that, for sharing the encrypted entity identifier (C.sub.i0) with the mappers (M.sub.j), it comprises a database that operates according to a protocol verified by third parties and provides decentralized authenticity.

15. The computer system of claim 14, characterised in that a blockchain database is applied as the database providing decentralized authenticity.

16. The computer system of claim 12, characterised in that it comprises a key manager (KM) adapted for generating the constants of the algebraic structure and the own cryptographic key (e.sub.i) of the data source (DS.sub.i) and the elements (d.sub.ij, a.sub.ij) corresponding thereto.

17. A computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the steps of the method of claim 1.

18. A computer-readable medium adapted for storing the computer program of claim 17.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0028] Preferred embodiments of the invention are described hereinafter by way of example with reference to the following drawings, where

[0029] FIG. 1 is a schematic diagram of a solution according to the invention implemented applying a key manager,

[0030] FIG. 2 is a schematic diagram of a solution according to the invention implemented without a key manager, and

[0031] FIG. 3 is a table schematically illustrating an exemplary pseudonym mapping process.

MODES FOR CARRYING OUT THE INVENTION

[0032] According to the invention it has been recognised that the characteristics of algebraic structures constituting multiplicative or additive cyclic groups can be preferably utilized to fulfil the objects of the invention. Two types of solutions based on such algebraic structures are described below in more detail, but, according to the invention, other such algebraic structures that provide the arithmetic required for the operation of the invention can also be applied. Of the exemplary algebraic structures, a solution involving residue classes modulo N (where N is a positive integer) is first described in detail, followed by describing, in relation to the former, a solution involving points of elliptic curves defined over the number field of residue classes modulo p (where p is a prime).

[0033] The entity identifiers and corresponding data are stored in databases at mutually independent data sources, and, after pseudonymisation according to the invention, the data, together with the pseudonyms generated from the entity identifiers, are stored, assigned to each other, in a central pseudonymised database. Complying with the conditions of the object set for the invention, for a given entity identifier, the relationship between unencrypted data and the pseudonym cannot be affected by the origin of the data (i.e. what data source it came from). However, the process of mappings, i.e. the operations performed at the particular stages, are unique for each data source, that is, different cryptographic keys (for example, modular exponents) have to be used for performing the same mapping. Apparently, the range of a mapping preceding another one in the sequence cannot be greater than the domain of the latter. For residue classes this implies that the value of the modulus cannot be decreased during the process. Since the order of the mappings depends on the mapped data and on the applied key, i.e. is different each time, this condition can only be fulfilled by applying a constant modulus. Therefore, the implementation of a data gathering system necessarily begins with selecting an appropriate modulus. This is carried out in practice by the provider of the data gathering service, or the data gathering community first deciding upon the bit length of the applicable keys. Then, two such prime numbers are selected of which the product (applied as the modulus) can be represented using the given number of bits. The entity or entities generating the keys (for example the key manager or the data sources) have to know the modulus N and also its value φ(N) given by the Euler function, or in other words, its totient value. The value N of the modulus has to be known by all participants performing mappings. If the representation size of the entity identifiers to be mapped is significantly smaller than the key size, some kind of padding method is preferably applied. This method has to be deterministic in the sense that every data source has to receive the same value such that the pseudonym is also deterministic, irrespective of the data source. The basic data of the mapping are therefore N and φ(N).

[0034] According to the invention, by random selection it is meant that the implementation of the method is not dependent on which particular elements of the given set are chosen. Accordingly, random selection is meant to include also quasi-random or pseudo-random selection, as well as all such selection methods (even according to rules unknown to an observer) wherein the selection appears to be random to the outside observer. If the set constitutes an algebraic structure, then, if it has a null element and/or a unit element, then it/they are not regarded as randomly selected. Also, in the case of residue classes, the selection of non-relatively prime values is avoided. However, for cryptography considerations it is worth selecting values for which the bit length of their representation fills up all the available space.

[0035] As can be seen in FIGS. 1 and 2, entity identifiers D and attributes (the latter are not shown in the diagram) describing the entities or their behaviour are stored by data sources DS.sub.i in their respective own databases. The relationship between a particular entity and the other entities can be regarded as a characteristics of the given entity. Thus, in such a relation the entity identifiers of the other entities are regarded as attributes (for example, B is a client of A, in which case B is an attribute and an entity identifier at the same time). Because the aim of the technical solution according to the invention is supporting anonymity, such data can also be regarded as an entity identifier.

[0036] The attributes related to the entity identifiers D are preferably passed on by the data sources DS.sub.i as unencrypted data, while the entity identifiers D are encrypted by the data sources DS.sub.i utilizing their own cryptographic keys. The resulting cipher is sent to the entities adapted to perform the mapping to the pseudonym P, i.e. to the mappers M.sub.j. At the same time, an assignment between the unencrypted data and the cipher, i.e. the encrypted entity identifier is maintained, because the database required for data analysis can only be loaded with useful information in such a manner.

[0037] For the security of the pseudonym mapping it is crucial that no entity is able to carry out the operation by itself, i.e. no one is able to generate a pseudonym from unencrypted data. This is possible only if no entity possesses the value of the below described exponent b that, together with the modulus N known by everyone, is sufficient for mapping an unencrypted value to a pseudonym: P≡D.sup.b mod N.

[0038] The above condition can be fulfilled only in case b is not computed. Because we are dealing with an exponent of modular exponentiation over residue classes, the operation can also be performed utilizing the multiplicative factors of b. If b=b.sub.1.Math.b.sub.2.Math. . . . .Math.b.sub.k, then

P≡D.sup.b.sup.1.sup..Math.b.sup.2.sup..Math. . . . .Math.b.sup.k.sup.mod φ(N)mod N≡D.sup.b mod N

If the system comprises a number k of mappers, then each may generate a factor for b that is relatively prime to φ(N). If φ(N) is not known to them, then they choose a prime number. Thereby, they can carry out the above mapping only collectively. In order to do that, it is not necessary to share the factors b.sub.j among them. What is required is that each mapper performs a modular exponentiation exactly once (in an arbitrary order), utilizing its own factor:

[00001] $P \equiv {({({(D_{i}^{b_{j 1}})}^{b_{j 2}})}^{.Math.})}^{b_{jk}} \mod N \equiv D^{b_{j 1} .Math. b_{j 2} .Math. .Math. .Math. b_{jk}} \mod N$

where the indices j.sub.p ∈{1 . . . k} stand for an arbitrary (arbitrary-order) permutation of the factors of b.

[0039] The cryptographic keys e.sub.i of the data sources DS.sub.i and the decryption factors are generated as follows. The pseudonym has to be computed by the mappers M.sub.j not from unencrypted data, but from the cipher computed by the data sources DS.sub.i. If the product b of the exponent factors was available, then the cipher

C.sub.i≡D.sup.e.sup.i mod N

received from the i-th data source would be used above for the following calculation:

P≡((C.sub.i).sup.d.sup.i mod N).sup.b mod N

Because according to a basic idea of the invention more than one (i.e. a number k of) mappers are applied, the same method is applied as for the exponent b: let us generate d.sub.i as a modular product having a number of factors equalling the number k of the mappers M.sub.j (since modular exponents are applied, here the modulus is φ(N)). In the case of the i-th data source, key generation begins with generating the factors of the exponent d.sub.i: d.sub.i=d.sub.i1.Math.d.sub.i2.Math. . . . .Math.d.sub.ik (the first index identifies the cryptographic key of the data source, and the second identifies the mapper) that are randomly selected by the data source and that each are relatively primes to φ(N). Then the extended Euclidean algorithm is applied for computing e.sub.i, for which the formula e.sub.id.sub.i ≡1 mod φ(N) will hold true, i.e. e.sub.i will be the inverse cryptographic key of the product. The number of the elements d.sub.ij or factors, equals the number k of mappers, so they have to be passed on—applying any known method—to the mappers M.sub.j in encrypted form. Utilizing an element b.sub.j randomly selected from the algebraic structure and kept secret, each mapper computes the pseudonym mapping exponent h.sub.ij ≡b.sub.j.Math.d.sub.ij mod φ(N) corresponding to the i-th cryptographic key, i.e. the mapping cryptographic key h.sub.ij of the mapper M.sub.j corresponding to the data source DS.sub.i. Since φ(N) is unknown to the mappers M.sub.j, they cannot perform normalization according to the modulus φ(N). As a result of this, the (maximum) size of the exponent will be twice the key size (because it is obtained as the product of two numbers that can each be represented utilizing the given key size), which does not pose any practical problems, because it represents the same residue class as would have been the result of normalization.

[0040] As an initial step of mapping the pseudonym P, the cipher C.sub.i ≡D.sup.e.sup.i mod N is passed on by the i-th data source to the mapper M.sub.j. The mapping process starts at this step. Before the first computational step, the initial value is the cipher of the data source that is denoted in the index C.sub.i0=C.sub.i. which corresponds to the encrypted entity identifier C.sub.i0 of the entity identifier D to be mapped. When, following the k-th step, all the mappers have already executed the modular exponentiation operation C.sub.i,s+1=C.sub.i,s.sup.h.sup.ij mod N applying its own exponent, the pseudonym C.sub.ik=P is obtained, because

[00002] ${({({(C_{i}^{h_{i 1}})}^{h_{i 2}})}^{.Math.})}^{h_{ik}} \mod N \equiv C_{i}^{Π_{j = 1}^{k} h_{ij} \mod φ (N)} \mod N \equiv C_{i}^{h} \mod N \equiv P$

where the order of the exponents h.sub.ij is arbitrary. Of course, in a concrete implementation the sequence order has to be determined somehow; it can be random, quasi-random, or deterministic. The pseudonym P is therefore generated by sequentially performing, on each encrypted entity identifier C.sub.i0 encrypted by the data sources DS.sub.i, a number k of mappings in a permutation of the mappers M.sub.j utilizing the mapping cryptographic keys h.sub.ij of the mappers M.sub.j corresponding to the data sources DS.sub.i. The solution according to the above formula is also preferable because the representation size does not increase in the course of the calculations, since the result of the exponentiation is normalised by way of the modular operation.

[0041] Therefore, such a key system was provided above that also fulfils requirements (3) and (4) above set for pseudonym generation, because for carrying out a mapping the cooperation of a data source and all of the mappers is required. For the same reason it is also impossible to mount an undetected rainbow-table attack, because it is sufficient if one of the mappers detects the initiation of hundreds of millions or billions of mapping processes. In such a case, the mappers not engaged in the cracking operation deny to perform mappings of the messages encrypted with the key having the given index.

[0042] To provide a concrete implementation of the above described idea, the roles of and the mode of cooperation between the different entities, i.e. the data sources DS.sub.i, the mappers M.sub.j and the optionally included key manager KM have to be established.

[0043] With the solution based on residue classes, that is, in the case wherein the entity identifiers D and the pseudonyms P are represented by residue classes modulo N, φ(N) has to be kept secret, because an entity possessing it can compute the inverse exponents, i.e. the inverse keys. In this solution, however, the inverse of the particular exponent factors does not yield the inverse of the encryption mapping, so it does not pose a danger if it is computed. The application of a key manager KM is therefore optional for implementing the system.

[0044] An exemplary solution including a key manager KM that can be seen in FIG. 1 comprises the following steps: [0045] (1) the constants of the mapping are generated by the key manager KM: N=p.Math.q, φ(N)=(p−1).Math.(q−1) (both factors are randomly selected prime numbers that can preferably be represented in half the bit length of the chosen key size), of which N is made public to the other entities and φ(N) is kept secret; [0046] (2) the secret elements b.sub.j, i.e. in this embodiment, the exponents, are generated by the mappers M.sub.j—because the elements have to be relatively prime to φ(N) that is unknown to the mappers M.sub.j, randomly chosen primes are selected, for which the condition is trivially fulfilled; [0047] (3) upon the request of a data source DS.sub.i, the key manager generates a cryptographic key e.sub.i with an identifier i for the applicant, and also generates the corresponding elements d.sub.i1, . . . , d.sub.ik, i.e. the exponent factors of the mapping for the number k of mappers M.sub.j according to the conditions specified in the above description; the keys are identified by a respective index to keep them assigned to the data sources; the data have to be kept secret so they have to be passed on in an encrypted manner; [0048] (4) the values h.sub.ij ≡d.sub.ij.Math.b.sub.j are computed by the mappers M.sub.j and are kept secret; because φ(N)− is not known to them, this exponent can be represented in double the bit length of the key size because this way the residue class cannot be represented applying the least positive integer; [0049] (5) the entity identifier D, i.e. the unencrypted data is mapped by the data source into the cipher C.sub.i0 ≡D.sup.e.sup.i mod N and is sent to the mappers M.sub.j; [0050] (6) in an agreed-upon order, the value C.sub.i,s+1=C.sub.i,s.sup.h.sup.ij mod N is computed, from (s=0) to (s=k−1), by the mappers M.sub.j such that the result C.sub.ik of the last mapping will become the pseudonym P.

[0051] In the above described embodiment, the data are encrypted by the data sources DS.sub.i applying respective own secret cryptographic keys e.sub.i identified by the index i, where a data source DS.sub.i can have an arbitrary number of keys that are mapped into the pseudonym P by the cooperation of a plurality (a number k) of mappers M.sub.j identified by the index j (j∈{1 . . . k}).

[0052] It is particularly preferable to choose prime numbers as the values p and q, because in that case the number of relative primes is known (it is (p−1).Math.(q−1)).

[0053] The exemplary implementation without a key manager that can be seen in FIG. 2 comprises the following steps (here the tasks of the key manager are preferably performed by the data source DS.sub.i). [0054] (1) the constants of the pseudonym mapping are generated either by a provider, or, consensually, by the participating entities: N=p.Math.q, φ(N)=(p−1).Math.(q−1) (both factors are randomly selected prime numbers that can preferably be represented in half the bit length of the chosen key size), it is not necessary to keep these data secret; [0055] (2) the secret elements b.sub.j, i.e. the exponents, are selected randomly by the mappers M.sub.j, because the exponents are relatively prime to φ(N) that is known to the mappers, in this case it is not necessary to select prime numbers; [0056] (3) each data source DS.sub.i is adapted for generating an arbitrary number of keys; in this case it is expedient to choose the hash code of the elements or factors of the cipher exponent e.sub.i and the elements d.sub.i1, . . . , d.sub.ik of the decryption exponent as the key identifier (after generating the factors) so that no central coordination for ensuring uniqueness is necessary; the key is generated as follows: the elements d.sub.i1, . . . , d.sub.ik or factors—which are relatively prime to φ(N), and of which the modular product d.sub.i calculated for the modulus φ(N), as a value derived from the aggregate of the elements, will be the exponent d.sub.i of the decryption key—are selected randomly by the data source; this is followed by finding the multiplicative inverse of d.sub.i for the same modulus, whereby e.sub.i is obtained. The elements or factors d.sub.i1, . . . , d.sub.ik are sent in encrypted form (for example, are uploaded to a blockchain) to the mapper M.sub.j corresponding to the particular index; if a blockchain is applied, the cryptographic key can be the public key of the mapper's wallet; [0057] (4) the value h.sub.ij ≡d.sub.ij.Math.b.sub.j mod φ(N) is computed by the mappers M.sub.j and is kept secret; [0058] (5) the entity identifier D, i.e. the unencrypted data is mapped by the data source DS.sub.i into the cipher C.sub.i0≡D.sup.e.sup.i mod N and is shared with the mappers M.sub.j, preferably by writing it into a database that operates according to a protocol verified by third parties and provides decentralized authenticity (practically, into a blockchain); [0059] (6) A) in an agreed-upon order, the mappers compute the value C.sub.i,s+1.sup.(j)=C.sub.i,s.sup.h.sup.ij mod N, where i denotes the cryptographic key of the data source, s the step number of the mapping, and j the identifier of the mapper; the result C.sub.ik of the last mapping will become the pseudonym P; if the mappers are the nodes of a “permissioned” blockchain, then the result of the mapper j that is entitled to close the following block according to the blockchain consensus protocol is written into the next block (C.sub.i,s+1=C.sub.i,s+1.sup.(j)); no further operation is performed by this node in the further course of the mapping (the exponent factor h.sub.ij can be use utilized exactly once by each entity in the mapping sequence); thereby, one less of the values C.sub.i,s+1.sup.(j) are generated in each step, and only one in the last step; [0060] B) in the case of a public blockchain, the results C.sub.i,s+1.sup.(j) are uploaded by the mappers into the blockchain during each transaction, followed by continuing the computation utilizing a value C.sub.i,s+1 according to a predetermined rule; such a rule can for example be that the modulo-N sum of the values C.sub.i,s+1.sup.(j) are computed, and then the value C.sub.i,s+1 applied for the further computation will be the value that is arithmetically closest thereto; in such a case, in order that each exponent factor h.sub.ij is utilized only once for computing the result, the mapper whose value has already been selected does not take part in further computations; the result C.sub.ik of the last mapping will become the pseudonym P; if a public blockchain is applied, it has to be guaranteed that the data source that is in possession of the unencrypted data is not able to track the chain of computations so that it cannot connect the data with the computed pseudonym.

[0061] During computing the pseudonym P, the values C.sub.i,s+1.sup.(j)=C.sub.i,s.sup.h.sup.ij mod N are computed by the mappers M.sub.j, from which the value C.sub.i,s+1, to be utilized as the input value of the subsequent computation step, is chosen by a program (for example, a blockchain smart contract) operating according to a verified protocol utilizing a deterministic method. Each mapping exponent has to be used only once in a mapping, with only those mappers M.sub.j performing a calculation in the following steps of which the result has not yet been selected (as of the current state of the process) as the input of the subsequent mapping. With a number k of mappers, the process is concluded by computing the pseudonym P in the k-th step (P≡C.sub.ik).

[0062] FIG. 3 illustrates how the mapping is performed according to a permutation with a given order; the steps and other information shown in the diagram are to be interpreted as per the description above.

[0063] In FIGS. 1 and 2, therefore, two conceptually different solutions are illustrated. [0064] (1) A key manager KM and a plurality of mappers M.sub.j are applied, wherein the key applied for mapping the unencrypted data into the pseudonym P is generated utilizing exponent factors, the key manager KM generating the required pair of cryptographic keys such that the exponent of the decryption key is generated factor-by-factor. [0065] (2) The cryptographic keys are generated by the data sources DS.sub.i such that the exponent of the decryption key is generated factor-by-factor. The key applied for mapping the unencrypted data into the pseudonym P is generated by the plurality of mappers M.sub.j applying exponent factors, there is no key manager.

[0066] As it was mentioned in the introduction, pseudonym mapping can also be performed applying points of elliptic curves (see for example the Wikipedia article “Elliptic curve”) defined over the number field of residue classes modulo p (where p is a prime). In this context, let the algebraic structure be the set of points satisfying the equation y.sup.2=x.sup.3+Ax+B mod p, where x, y, A and B are the residue classes of the prime number p. First, the unencrypted entity identifier m has to be assigned to a point of the curve. Let us choose a point G of the curve having an order q that is sufficiently great that the points of the message space can be assigned to the points generated by G applying a one-to-one mapping. (For all points of the curve there is a number q being the number of additions to itself of the point required for reaching the point O at infinity. The smallest of such numbers q gives the order of the point.) To achieve that, for example the following method can be applied (Aritro Sengupta, Utpal Kumar Ray: Message mapping and reverse mapping in elliptic curve cryptosystem (2016)). At the low order digits the binary representation of D is complemented by 8 bits. In the above defined formula of the curve, x is substituted with the value thus obtained. If no solution exists for y, then the value of x is increased by one. If a solution does exist, then a point M of the finite algebraic structure defined by the curve has been obtained. The description related to the specification of the objects above is applied here such that this point is projected by the i-th data source DS.sub.i to another point C.sub.i of the curves applying its own cryptographic key, followed by it being projected by the mappers to the point P utilized as a pseudonym such that the different ciphers C.sub.i are assigned to the same point P if and only if the point M was identical.

[0067] Because the solution based on algebraic structures forming an additive cyclic group operates in an analogous manner to the solution based on a multiplicative cyclic group, it is not shown separately. The references shown in the figures can be substituted, where needed, with the corresponding operations and references included in the following description. The values x, y, A, B and p adapted to define the algebraic structure are defined by the entity providing the pseudonym mapping service that also selects the point G with a known order greater than the multiplicity of the message space. The entity then shares the data with the data sources and the mappers. A respective secret key b.sub.j, j=1 . . . k is chosen randomly by each of the number k of mappers from the residue classes of mod q, selecting values different from 1 and 0. The sum of these values is denoted by b=Σ.sub.j=1.sup.kb.sub.j.

[0068] For data provision, as many elements a.sub.ij (i.e., numbers) as the number k of mappers are randomly selected from the residue classes of q by the i-th data source, the sum a.sub.i of the elements will be the own cryptographic key e.sub.i=a.sub.i=Σ.sub.j=1.sup.ka.sub.ij thereof. This key is passed on in an encrypted form to the mapper with the appropriate index, and the latter then computes the mapping key corresponding to the data source applying the formula h.sub.ij=a.sub.ij+b.sub.j. In the case of a blockchain system, the public portion of the signing key of the mapper can be utilized for the encryption.

[0069] After that, the encryption operation is performed by the data source DS.sub.i by adding two points: C.sub.i0=M⊕a.sub.iG, where the operator ⊕ denotes the addition of two points of the curve, and scalar multiplication denotes repeated addition. The above process carried out on residue classes is modified only in that the below described operation is performed on the points of the curve. In the s-th step the following operation is performed by the mapper with the index j on the data originating from the i-th data source: C.sub.i,s+1.sup.(j)=C.sub.i,s⊕(−h.sub.ijG), where the unary operator “−” denotes the reflection of a curve point over the x axis. The operation ⊕ utilizing such values are hereinafter denoted with the operator ⊖. Thus, performing a complete sequence of mappings, the pseudonym is obtained as a result of the following operations:

[00003] $P = M \oplus a_{i} G ⊖ h_{i 1} G ⊖ .Math. ⊖ h_{i k} G = M \oplus a_{i} G ⊖ (a_{i 1} + b_{1} + .Math. + a_{i k} + b_{k}) G = M \oplus (a_{i} - ({.Math.}_{j = 1}^{k} a_{ij} + {.Math.}_{j = 1}^{k} b_{k})) G = M \oplus (a_{i} - a_{i}) G ⊖ b G = M ⊖ b G$

[0070] Thus, the same entity identifier D is sent by each data source as a different cipher, but finally it is assigned to the same pseudonym P. Optionally, the x coordinate of the point P can also be applied as the pseudonym.

[0071] For computing the pseudonym P, the values C.sub.i,s+1.sup.(j)=C.sub.i,s⊖h.sub.ijG are therefore computed by the mappers M.sub.j from (s=0) to (s=k−1), where A⊖B=A⊕(−B), from which the value C.sub.i,s+1, to be utilized as the input value of the subsequent computation step, is selected by a program (for example, a blockchain smart contract) operating according to a verified protocol utilizing a deterministic method. Each mapping key has to be used only once in a mapping. In the next step, only those mappers M.sub.j perform a calculation of which the result has not yet been selected (as of the current state of the process) as the input of the subsequent mapping. With a number k of mappers, the process ends by computing the pseudonym P in the k-th step (P=C.sub.ik).

[0072] If a key manager KM is to be utilized, then this entity is applied for generating, i.e. for randomly selecting, the addends of a.sub.i.

[0073] Therefore, in order to ensure that possessing any component of the system is not sufficient to allow for deciphering the relationship between the pseudonym P and the entity identifier D, the following data conversion is performed by the pseudonym mapping system according to the invention: [0074] producing from data that are available at data sources DS.sub.i and that are suitable for identifying persons, things or other entities by a characteristic name, i.e. from the entity identifier D, [0075] such pseudonymised data, in which the entity identifiers D are replaced by a pseudonym P assigned thereto in a one-to-one manner independent of the cryptographic key e.sub.i utilized by the data source DS.sub.i,
such that [0076] as many elements d.sub.ij, a.sub.ij are chosen randomly, by the data sources DS.sub.i applying an encryption means/module, from an algebraic structure forming a multiplicative or additive cyclic group utilized by the cryptographic algorithm, as the number of the mappers M.sub.j (preferably, here the following functionality is implemented by the encryption means/module: it generates a random number which can be mapped into the key space applying a suitable mapping in order to select from among the elements of the key space with a near-uniform probability, i.e. randomly), of which elements the inverse cryptographic key is computed—depending on the structure, utilizing their product or their sum—and is utilized as their own, unique encryption key e.sub.i for encrypting their respective entity identifiers D, and [0077] these ciphers (or encrypted data) are mapped into respective pseudonyms by a plurality of encryption means called mappers M.sub.j, the mapping being performed by the mappers M.sub.j in a centralized system or in a decentralized (peer-to-peer, e.g. blockchain) network applying their respective own unique mapping cryptographic keys h.sub.ij, executing their own operation in an arbitrary sequence order,
such that [0078] for computing each unique mapping cryptographic key h.sub.ij, a single one of the elements d.sub.ij passed on by the data source DS.sub.i, and a data element that is randomly selected by the mapper itself from the applied algebraic structure and is kept secret, are applied.

[0079] The computer system for cryptographic pseudonymisation according to the invention comprises [0080] data sources DS.sub.i containing data related to entities, the data being identified at the data sources DS.sub.i by the entity identifiers D of the entities, [0081] a pseudonymised database DB wherein the data are identified by respective pseudonyms P assigned, in a one-to-one manner, to each of the entity identifiers D, [0082] a number k (i.e. more than one) of mappers M.sub.j, [0083] optionally, a key manager KM, and [0084] modules implementing the above described functions and/or entities, which modules can be hardware, software, or combined hardware-software modules.

[0085] The key manager KM is preferably an apparatus comprising a processor adapted for executing a program and memory adapted for providing data writing, storage, and read-out functions. The program run on the apparatus is adapted to generate the data required for executing the mappings, for example the modular exponent adapted for generating a pseudonym from unencrypted data and the totient value of the modulus. The apparatus is adapted for storing these values such that they cannot be accessed by anybody else, but it can still be capable of performing computations utilizing them. In addition to that, it is also capable of computing modular exponent key pairs applying the above described process, for example the extended Euclidean algorithm, and of passing on the encrypted exponent to the data source over a secure data channel and computing the exponent applied for pseudonym mapping, which latter it can also pass on to the entity performing the mapping over a secure data channel. All these requirements are fulfilled for example by the above-mentioned Trusted Platform Module (TPM) circuits.

[0086] The mapper M.sub.j is preferably an apparatus that is adapted for reading any input parameters of modular exponentiation (base, exponent, modulus), as well as executing the operation and making the result available for readout. The mapper apparatus has to comprise a module adapted for random number generation. Such a module can for example be implemented as a general-purpose computer or microcontroller. TPM circuits also fulfil all the above listed requirements.

[0087] Another aspect of the invention is a computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the steps of the method according to the invention. The invention further relates to a computer-readable medium adapted for storing the above-mentioned computer program.

[0088] The invention can be applied for various purposes; one of these being the analysis of loyalty card purchase databases involving multiple stores. Let us assume that a company engaged in business analysis and market research activities prepares an analysis of typical customer behaviour in retail stores, which is then purchased by its clients. The analysis is aimed at defining customer groups based on characteristics like the products purchased, the frequency of purchases, the relationship between type and location of stores, the season of year, and the products purchased, etc.

[0089] In order to prepare the analysis, the company needs data. In addition to using publicly available statistical data, such companies often seek to motivate retailer chains and individual stores into cooperating with them. To facilitate that, they for example share part of their research results with the retailers so that they can improve the efficacy of their advertising and improve their selection of products. In many store purchase transactions, none of the characteristics of the customer are known. Although the data included in the receipt can be utilized, the only extra information it provides compared to product sale statistics is that it includes information on products sold during a single purchase transaction and the exact time and date thereof. At the same time, the stores can also offer loyalty card programs. Customers are offered various discounts for taking part in such programs. In the case of such purchases, personal information on the customer and other data thereof relevant for analytic purposes are known. Such data have been passed on (in varying detail) to market research companies by some of the stores (data sources), however, due to a change in legislation related to protecting personal data, this practice will soon end. So, the most important product of the market research company, the “retail market report” has become jeopardized. The regulation on personal data protection makes the above business impossible, although analysing the behaviour of customer groups does not require the possession of concrete personal data of any of the customers.

[0090] If those pieces of data that are applicable for personal identification are simply removed from the data passed on by the stores (except for, possibly, sex, age and postcode) then more valuable results can be obtained compared to those based on purchase receipts, but the information related to particular purchases of a given (anonymous) person at a given store is lost, although possessing and processing such information is not legally prohibited. The stores have therefore committed to use a made-up identifier, i.e. a pseudonym for the identification of the purchases of a given customer. This further improves analysability, but this way a customer who made purchases in different stores will be treated as multiple different persons if the mode of pseudonymization is not uniform.

[0091] The idea may arise that a mapping implemented utilizing a so-called “salted” cryptographic hash function can be applied to the personal data (such as name, sex, birth date, and postcode), but certain lawyers representing the stores may reject this option because the resulting hash data can be connected, by the entity performing the data analysis, to the personal data simply by registering itself as a store and compiling a rainbow table for example from the electoral register. The invention provides a solution to this problem. The implementation of the solution according to the invention can comprise a server software component that allows that the data sources DS.sub.i generate and store their key on their own computer by visiting a web page (after authentication). The computations to performed by the mappers M.sub.j and the program supporting communication with the blockchain system can be written for a cloud environment. The service can be activated at various different cloud service providers such that its operation cannot be affected (except for starting and stopping it) by any of the entities; this setup can preferably also be audited. Utilizing the client software belonging to the web page, the distribution of the key factors is passed on by the stores to the mappers Mi, the stores then uploading the data to the blockchain (after encrypting them utilizing the key stored at them), where the pseudonym is generated as a result of the mapping sequence.

[0092] Thereby, the ciphers generated individually by the different stores are mapped into the same value by the entire computational chain. Also, the application of blockchain technology makes it impossible to compile a rainbow table that would be applicable for restoring the relationship between the unencrypted data and the pseudonym P.

[0093] Thus, the analyses can be applied for picking out customers who typically make their purchases in a given store but usually buy a particular product somewhere else, or on certain days do their shopping at a different location shortly after store closure. These are valuable pieces of information that can support business decisions. For example, it is preferable to stock another brand of a particular product, or to close an hour later on Fridays.

LIST OF REFERENCE SIGNS

[0094] D entity identifier [0095] P pseudonym [0096] DS.sub.i data sources [0097] M.sub.j mappers [0098] KM key manager [0099] d.sub.ij elements (inverse key factors in a multiplicative structure) [0100] d.sub.i modular product of elements [0101] e.sub.i cryptographic keys (in a multiplicative structure) [0102] a.sub.ij elements (cryptographic key addends in an additive algebraic structure) [0103] a.sub.i sum [0104] b.sub.j secret element (factor or addend of own mapping cryptographic key) [0105] C.sub.i0 encrypted entity identifier [0106] h.sub.ij mapping cryptographic keys [0107] DB database

Cryptographic Pseudonym Mapping Method, Computer System, Computer Program And Computer-Readable Medium

Assignee

Inventors

Cpc classification

Classification Explorer

H04L9/0866

ELECTRICITY

Classification Explorer

H04L9/3239

ELECTRICITY

Classification Explorer

H04L9/3066

ELECTRICITY

Classification Explorer

H04L9/0825

ELECTRICITY

Classification Explorer

H04L9/0894

ELECTRICITY

Classification Explorer

H04L9/302

ELECTRICITY

Classification Explorer

H04L2209/42

ELECTRICITY

Classification Explorer

H04L9/50

ELECTRICITY

Classification Explorer

H04L2209/76

ELECTRICITY

International classification

Classification Explorer

H04L9/30

ELECTRICITY

Classification Explorer

H04L9/00

ELECTRICITY

Classification Explorer

H04L9/08

ELECTRICITY

Abstract

Claims

Description