METHOD FOR EVALUATING THE RISK OF RE-IDENTIFICATION OF ANONYMISED DATA

20230367901 · 2023-11-16

    Inventors

    Cpc classification

    International classification

    Abstract

    (EN) The method of the invention provides a protection rate (txP2) representative of the risk of re-identification of data. In the case of a distance-based correspondence-seeking attack, the method comprises the steps of: a) linking an original dataset (EDO) comprising a plurality of original individuals (IO) with an anonymised dataset (EDA) comprising a plurality of anonymised individuals (IA); b) transforming (PCA, MCA, FAMD) the original individuals and the anonymous individuals in a Euclidean space; c) identifying for each original individual, one or more nearest anonymous individuals based on a distance, by a method referred to as the “k-NN” method; and d) calculating the protection rate, being a mean number (Nm) of anonymous individuals, nearest to a considered original individual (IO.sub.i), who are not a valid anonymous individual corresponding to the original individual considered, the nearest anonymous individuals being those identified in step c) and having a distance (dy) relative to the considered original individual less than the distance between the considered original individual and the valid anonymous individual.

    Claims

    2. The method according to claim 1, characterized in that said distance is a Euclidean distance.

    3. The method according to claim 1, characterized in that the transformation of step b) is carried out by means of a factor analysis method (PCA, MCA, FAMD) and/or using an “autoencoder” artificial neural network.

    4. The method according to claim 3, characterized in that the said factor analysis method is a “Principal Component Analysis” (PCA) when said individuals (IO, IA) comprise continuous variables, a “Multiple Correspondence Analysis” method (MCA) when said individuals (IO, IA) comprise qualitative variables, or a “Factor Analysis of Mixed Data” (FAMD) method when said individuals (IO, IA) comprise “continuous/qualitative” type variables.

    5. A data anonymization computer system (SAD) comprising a data storage device (SD) that stores program instructions (TEM) for implementing the method according to claim 1.

    6. A computer program product comprising a medium wherein program instructions (TEM) are saved that are readable by a processor for implementing the method according to claim 1.

    Description

    [0023] Other advantages and features of the present invention will become more clearly apparent on reading the following description of several particular embodiments with reference to the appended drawings, in which:

    [0024] FIG. 1 is a flowchart showing a particular embodiment of the method according to the invention.

    [0025] FIG. 2 shows an illustrative diagram relating to the embodiment of the method according to the invention of FIG. 1.

    [0026] FIG. 3 shows an example of a general architecture of a data anonymization computer system wherein the method according to the invention is implemented.

    [0027] In the following description, for purposes of explanation and not of limitation, specific details are provided to allow an understanding of the described technology. It will be obvious to the person skilled in the art that other embodiments may be put into practice outside the specific details described below. In other cases, the detailed descriptions of well-known methods, techniques, etc. are omitted so as not to complicate the description with unnecessary details.

    [0028] The evaluation of the risk of re-identification requires comparing a set of original data formed from so-called original individuals to a set of anonymized data formed from individuals said to be anonymous. The individuals are typically data records. Each anonymous individual of the anonymized dataset represents an anonymized version of a corresponding original individual. A pair consisting of an original individual and a corresponding anonymous individual is referred to as an “original/anonymous pair”. The risk of re-identification is the risk that an attacker is able to link an original individual to its anonymized record, i.e. the corresponding anonymous individual, thus forming a valid original/anonymous pair.

    [0029] The method according to the invention for evaluating the data re-identification risk provides a metric, based on an individual-centric approach, which makes it possible to quantify the risk of re-identification of an item of personal data during a distance-based correspondence-seeking attack.

    [0030] Referring to FIGS. 1 and 2, one particular embodiment, designated MR2, of the method of the invention, will now be described, which has interesting applicability in the context of a distance-based correspondence-seeking attack. This particular embodiment MR2 is constructed with a decidedly different approach compared to the methods known from the prior art, by establishing a protection rate which is based on the evaluation of a density of the presence of anonymous individuals in the immediate environment of the original individuals.

    [0031] As shown in FIG. 1, this embodiment MR2 comprises five steps S2-1 to S2-5.

    [0032] The first step S2-1 performs a junction processing of the data. The first step S2-1 is a step of joining the data. In step S2-1, an original dataset EDO comprising a plurality of original individuals 10 is linked to an anonymized dataset EDA comprising a plurality of anonymous individuals IA. The anonymized data EDA are those provided by an anonymization process that has processed the original data EDO corresponding thereto.

    [0033] The second step S2-2 carries out a transformation processing of the individuals IO and IA in a Euclidean space. According to the invention, different transformation methods can be used. Typically, but not exclusively, a factor analysis method or an artificial neural network called an “auto-encoder” may be used to convert the individuals 10 and IA into coordinates in Euclidean space.

    [0034] Different factor analysis methods may be used depending on the type of data.

    [0035] Thus, Principal Component Analysis, or “PCA”, will typically be used when the variables are continuous. Multiple Correspondence Analysis, or “MCA”, will typically be used if the variables are qualitative. Factor Analysis of Mixed Data, or “FAMD”, will typically be used if the variables are mixed, that is, of the continuous type and qualitative type.

    [0036] In the embodiment described here, a factor analysis method is used in step S2-2. In this step S2-2, significant variance axes are identified in the datasets by a multivariate data analysis. These significant variance axes determine the axes of Euclidean space on which the individuals 10 and IA are projected.

    [0037] The transformation of the individuals 10 and IA in Euclidean space makes it possible to calculate mathematical distances between the individuals, from their coordinates. The method of the invention provides for a preferred use of a Euclidean distance as a mathematical distance. However, it will be noted

    [0038] that using other, different mathematical distances, such as a Manhattan distance, a Mahalanobis distance, or other distance falls within the vision of the present invention.

    [0039] The third step S2-3 is a step of calculating mathematical distance, such as a Euclidean distance. In this step S2-3, as illustrated in FIG. 2 wherein the original individuals 10 and the anonymous individuals IA are represented respectively by black circles and white circles, in a Euclidean space having axes A1 and A2, for each original individual IO.sub.i the mathematical distance d.sub.ii that separates it from the anonymous individual IAi with which it forms a valid origin/anonymous pair (IO.sub.i, IA.sub.i) is calculated.

    [0040] The fourth step S2-4 is a step of counting, for each original individual Io.sub.i, the number N.sub.j of non-valid anonymous individuals IA.sub.j separated from the original individual IO.sub.i by a mathematical distance d.sub.ii which is less than the distance d.sub.ii calculated in step S2-3. The “k-Nearest Neighbors” method, also known as k-NN, is used here to identify for each original individual one or more nearest anonymous individuals based on a mathematical distance, such as a Euclidean distance.

    [0041] In this step S2-4, as illustrated in FIG. 2, the number N.sub.j of non-valid anonymous individuals IA.sub.j present in the zone contained in the circle with radius d.sub.ii centered on the original individual IO.sub.i is therefore counted.

    [0042] The higher the number N.sub.j is, the better-protected the original individual IO.sub.i is against re-identification. Indeed, since the N.sub.j non-valid anonymous individuals IA.sub.j are closer, in terms of mathematical distance, to the original individual IO.sub.i than the valid anonymous individual IA.sub.i, a distance-based attack will be based on the preferred selection of one of the N.sub.j non-valid anonymous individuals IA.sub.j as being the corresponding anonymous individual. The number N.sub.j represents the number of possible matches for the attacker before selecting the valid anonymous individual IA.sub.i.

    [0043] The fifth step S2-5 is a step of calculating the protection rate of the data against re-identification, referred to herein as txP2, for the dataset considered. The protection rate txP2 is here calculated as being a median number Nm of non-valid anonymous individuals IA.sub.j present around an original individual in the considered dataset.

    [0044] By way of example, let us now consider the case of an attacker who is in possession of a dataset containing anonymous data (individuals IA) of 100 people, including a considered person i. The attacker is also in possession of the original data (individual IO.sub.i) of the considered person i. The attacker attempts to prove that the original data (individual IO.sub.i) of the considered person i is part of the anonymized cohort. In order to re-identify the valid original/anonymous pair (IO.sub.i, IA.sub.i), the attacker must match the individuals, and for this purpose, uses a mathematical distance between them, such as a Euclidean distance. If, for example, the data protection rate is txP2=7 for this dataset, this means that the attacker will then be in a situation, as shown in FIG. 2, wherein it will have on average N.sub.j=7 non-valid anonymous individuals IA.sub.j who are closer than the valid anonymous individual IA.sub.i and are potentially selectable. Thus, the denser the environment of the original individual IO.sub.i, with many non-valid anonymous individuals IA.sub.j, the more difficult it will be to re-identify that individual IO.sub.i.

    [0045] By way of example, FIG. 3 shows a general architecture of a data anonymization computer system SAD wherein the method according to the invention for evaluating the re-identification risk is implemented.

    [0046] The system SAD is installed here in a local computer system DSL and comprises two software modules MAD and TEM. The software modules MAD and TEM are hosted in data storage devices SD, such as a memory and/or hard drive, of the local computer system DSL. The local computer system DSL also hosts an original database BDO wherein original data DO are stored, and an anonymized database BDA wherein anonymized data DA are stored.

    [0047] The software module MAD implements a data anonymization process that processes the original data DO and outputs the anonymized data DA.

    [0048] The software module MET implements the method according to the invention for evaluating the risk of re-identification of the data. The software module MET receives, as input, original data DO and anonymized DA data and provides, as output, a rate TP of protection against the risk of re-identification. The implementation of the method according to the invention is ensured by the execution of code instructions from the software module MET by a processor (not shown) of the local computer system DSL. The protection rate TP provided by the software module MET provides a measurement of the performance of the data anonymization process implemented by the software module MAD.

    [0049] Of course, the invention is not limited to the embodiments which have been described here by way of illustration. The person skilled in the art, depending on the applications of the invention, can provide different modifications and variants that fall within the scope of protection of the invention. 1. A computer-implemented data processing method for evaluating an anonymized data-re-identification risk, said method providing a protection rate (txP2) representative of said re-identification risk in the case of a distance-based correspondence-seeking attack, said method comprising the steps of a) linking an original dataset (EDO) comprising a plurality of original individuals (IO) with an anonymized dataset (EDA) comprising a plurality of anonymous individuals (IA), said anonymous individuals (IA) being produced by a process of anonymizing said original individuals (IO); b) transforming (PCA, MCA, FAMD) said original individuals (IO) and said anonymous individuals (IA) in a Euclidean space (A1, A2), said original individuals (IO) and anonymous individuals (IA) being represented by coordinates in said Euclidean space (A1, A2); c) identifying for each said original individual, one or more said anonymous individuals based on a distance, by a method known as the “k-NN” method, and d) calculating said protection rate (txP2) as a mean number (Nm) of said anonymous individuals (IA.sub.j) nearest to a said original individual (IO.sub.i) which are not a valid anonymous individual (IA.sub.i) corresponding to said original individual (IO.sub.i), said nearest anonymous individuals (IA.sub.j) being those identified in step c) and having a distance (d.sub.ij) relative to said original individual (IO.sub.i) less than the distance (d.sub.ii) between said original individual (IO.sub.i) and said valid anonymous individual (IA.sub.i).