Method for anonymizing personal information in big data and combining anonymized data
11501020 · 2022-11-15
Assignee
Inventors
Cpc classification
G06F16/2465
PHYSICS
G06F21/6254
PHYSICS
International classification
G06F21/62
PHYSICS
G06F16/2458
PHYSICS
Abstract
A method for anonymizing and combining personal information in big data, by which big data is anonymized to be freely distributed to an external system without fear of personal information leakage and distributed data can be combined with each other is proposed.
Claims
1. A method for processing big data, the method being performed in a data server having a communication unit, a processing unit, and a storage unit to anonymize personal information in the big data, the method comprising: forming a synchronization dictionary by converting synchronization target personal identification attribute values of an original personal identification data set into surrogate values substituting for the attribute values; forming a synchronization table having information on a starting value and an end value of each of groups by grouping the surrogate values of the synchronization dictionary in a unit of at least one consecutive k (k≥1); performing cell sectionalization of forming cell sections having information on a new starting value and a new end value by applying an error value to the information on the starting value and the end value of each of the groups of the synchronization table corresponding to the respective personal identification attribute value, for each of the personal identification attribute values of the original personal identification data set; and changing the personal identification attribute values of the original personal identification data set to correspond to section values of cells and assigning each of the changed attribute values as a synchronization attribute value of an anonymous data set, wherein the cell sectionalization includes setting a first value, which is obtained by subtracting a first arbitrary error value between a minimum error value and a maximum error value from a minimum value in the respective group, as a start value of a cell, and setting a second value, which is obtained by adding a second arbitrary error value between the minimum error value and the maximum error value to a maximum value in the respective group, as an end value of the cell, for each of the groups of the synchronization table.
2. The method of claim 1, wherein the synchronization dictionary is formed by applying a predefined function to the synchronization attribute value.
3. The method of claim 1, wherein the synchronization attribute value is formed as one data value obtained by concatenating the start value and the end value of the cell.
4. The method of claim 1, wherein the synchronization dictionary is formed by obtaining union of individual synchronization dictionaries that are formed by selecting a same original personal identification attribute from a plurality of mutually different original data sets as a synchronization target attribute.
5. The method of claim 1, wherein the synchronization dictionary is formed by using a part of the synchronization attribute value.
6. The method of claim 1, wherein, when ranges of section values of synchronization attributes generated for at least two personal identification attribute values overlap each other in a case where the synchronization attributes are generated, the synchronization attributes are regenerated so that a new error value is applied.
7. The method of claim 1, wherein, with reference to cell section information of a synchronization map including synchronization attributes of a plurality of anonymous data sets, it is determined whether cell sections corresponding to the synchronization attributes of records of the plurality of anonymous data sets overlap each other, and when the cell sections of the synchronization attributes overlap each other by a predetermined value or more, the records are determined to correspond to a same person so that the records are subject to an anonymous matching combination operation.
8. The method of claim 7, wherein it is determined whether cells corresponding to two anonymous identifiers (a, b) overlap each other by an anonymous matching operation defined as follows: anonymous matching operation a #b=(true, C(a, b)) or (false, 0), wherein anonymous identifier a=(a.s, a.e), where a.s is a cell start value of a, and a.e is a cell end value of a, wherein anonymous identifier b=(b.s, b.e), where b.s is a cell start value of b, and b.e is a cell end value of b, and wherein C(a, b)=a probability value that records a and b exist in an overlapping section.
9. The method of claim 8, wherein an entire domain of synchronization attribute values is discretized in a unit of a fixed size to generate an empty anonymous identifier bucket in each of discretization sections, so that the anonymous matching operation is performed for anonymous identifiers commonly belonging to the bucket.
10. The method of claim 7, wherein the anonymous matching combination operation is performed in a unit of a record pair of two original data sets.
11. The method of claim 7, wherein anonymous identifiers of personal identifiers of two original data sets are collected to perform the anonymous matching combination operation on an anonymous identifier set of two anonymous data sets.
Description
DESCRIPTION OF DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
(19)
(20)
(21)
(22)
(23)
(24)
(25)
(26)
MODE FOR INVENTION
Best Mode
(27) Hereinafter, a method for anonymizing personal information in big data and a method for combining anonymized data according to the present invention will be described in detail.
(28) According to the present invention, in order to anonymously combine and use a plurality of original personal identification data sets, each of the original data sets may be anonymized to generate an anonymous data set.
(29) In this case, instead of a combination target personal identification attribute of the original personal identification data set, a synchronization attribute used when combining the anonymous data sets may be assigned to each of the anonymous data sets.
(30) In the present invention, the ‘synchronization attribute’ refers to an attribute assigned to an anonymous data set by anonymizing a personal identification attribute value in an original data set in order to combine anonymous data sets with each other, and a ‘synchronization operation’ refers to an operation of combining records of a data set based on the synchronization attribute assigned to the anonymous data set.
(31) The synchronization attribute may be generated through: generating a synchronization dictionary by converting synchronization target personal identification attribute values of an original personal identification data set into surrogate values that may substitute for the attribute values; forming a synchronization table having a starting value and an end value of each of groups by grouping the values of the synchronization dictionary; a cell sectionalization step of applying an error value to the starting value and the end value of each of the groups of the synchronization table corresponding to an original personal identification attribute value to define an application result as a cell; and changing the personal identification attribute values of the original personal identification data set to correspond to section values of cells, respectively, so as to assign the changed value as a synchronization attribute value of an anonymous data set.
(32) One or more synchronization attributes in an anonymous data set are collectively defined as a ‘synchronization map’ of the data set.
(33) A synchronization operation refers to an operation of combining anonymous data through the synchronization map, which is a set of synchronization attributes, in which the anonymous data may be combined based on a cell corresponding to each synchronization attribute value by comparing cell sections of two synchronization maps generated from the same original personal identification attribute.
(34) Hereinafter, a synchronization dictionary generation step, a cell sectionalization step, a synchronization attribute generation step by using a synchronization dictionary and cell sectionalization, and combination of the anonymous data sets generated as described above, that is, the synchronization operation will be sequentially described.
(35) The synchronization dictionary may be classified into an individual synchronization dictionary, a local synchronization dictionary that is union of individual synchronization dictionaries, and a global synchronization dictionary created such that all possible values for a corresponding personal identification attribute are configured as a domain.
(36) A step of generating the individual synchronization dictionary will be described as follows.
(37) First, one original personal identification attribute existing in an original data set may be selected as a synchronization target attribute.
(38) Since a person may be identified when the synchronization target attribute is used as a personal identification attribute value, the synchronization target attribute may be converted into a surrogate value substituting for the personal identification attribute value by applying a predefined eigenfunction for each type of synchronization target attributes having different meanings such as a resident registration number, a mobile phone number, and an email address.
(39) The same eigenfunction may be applied to the same personal identification attribute, and each eigenfunction is preferably a one-to-one correspondence function while being a function having a range that is a sortable value similarly to the original personal identification attribute value.
(40) The personal identification attribute value may be modified by applying an eigenfunction determined for each type of personal identification attributes such as a resident registration number, a mobile phone number, and an email address, so that the personal identification attribute may be removed while being converted into a sortable attribute value.
(41) The converted attribute is defined as a synchronization dictionary for a corresponding original personal identification attribute.
(42)
(43) When any data set has the resident registration number attribute, the same eigenfunction f (resident registration number) may be applied to the corresponding original personal identification attribute.
(44) As illustrated in
(45) The eigenfunction refers to a function of mapping individual values existing in a domain of original personal identification attribute values such as a resident registration number to a sortable range domain corresponding to the individual values one-to-one, and the eigenfunction may be defined in various ways.
(46) In particular, when a hashing scheme is applied, a sorting order of the original personal identification attribute values may be randomly changed, so that personal re-identification may become impossible to the maximum extent. When the hashing scheme is applied, in order to minimize hash collisions, it is preferable to set as many home addresses of a hash function as possible.
(47) The local synchronization dictionary refers to union of all individual synchronization dictionaries generated by selecting the same original personal identification attribute existing in multiple mutually different original data sets as a synchronization target attribute.
(48) For example, when both an original personal identification data set A and an original personal identification data set B have resident registration number attributes within a safe space, and an individual synchronization dictionary is generated for each of the resident registration number attributes, as illustrated in
(49) The global synchronization dictionary refers to a dictionary generated such that all possible values for a corresponding personal identification attribute are configured as a domain.
(50) For example, a global synchronization dictionary for resident registration numbers refers to a synchronization dictionary generated from a data set including data corresponding to resident registration numbers of all Korean citizens.
(51) When all values of resident registration numbers of citizens exist in a single safe space, the global synchronization dictionary for the resident registration number may be generated by generating an individual synchronization dictionary for resident registration numbers by applying an eigenfunction for the resident registration numbers to the corresponding data set.
(52) When union of individual synchronization dictionaries is continuously obtained in a local synchronization dictionary for resident registration numbers, a result may converge into the global synchronization dictionary.
(53) Meanwhile, since the synchronization target attribute is the original personal personal identification attribute when the synchronization dictionary is generated, a person and an attribute value may correspond to each other one-to-one as illustrated in
(54) In order to fundamentally eliminate the risk, the original personally identifiable attribute may be partially used. When only a part of the original personal identification attribute value is used without using the whole original personal identification attribute value, multiple persons may correspond to one value, so that the re-identification risk may be eliminated even for data before generation of the synchronization attribute.
(55) For example, as illustrated in
(56) However, synchronization combination accuracy may be slightly lower as compared with the scheme using the whole 13 digits.
(57) As a scheme for compensating for the synchronization combination accuracy, there is a scheme of constructing a synchronization map by generating two or more synchronization attributes that are generated by partially using the original attributes.
(58) For example, when only 10 digits out of 13 digits of a resident registration number are selected, the synchronization map may be constructed by using two attributes: a synchronization attribute generated by selecting only the first 10 digits; and a synchronization attribute generated by selecting only the last 10 digits. Accordingly, as the number of synchronization attributes of the synchronization map increases, the synchronization combination accuracy may be improved.
(59) Next, the cell sectionaliiation step will be described.
(60) Cell sectionalization refers to an operation of grouping synchronization dictionary values in a given synchronization dictionary domain and applying an error to each group to define the group as a cell. The specific procedure of the cell sectionaliiation will be described as follows.
(61) 1) A synchronization table having information on a starting value and an end value of each of groups is formed by grouping given synchronization dictionary values in a unit of consecutive k (where k denotes a cell fixed frequency and is a natural number greater than or equal to 1).
(62) 2) For each of the groups of the synchronization table, a value obtained by subtracting an arbitrary error value between a minimum error value e.sub.min and a maximum error value e.sub.max from a minimum value in the group is determined as a starting value of a corresponding cell.
(63) 3) For each of the groups of the synchronization table, a value obtained by adding an arbitrary error value between the minimum error value e.sub.min and the maximum error value e.sub.max from a maximum value in the group is determined as an end value of the corresponding cell.
(64) The above steps will be described with an example as follows.
(65) First, the groups may be generated in the unit of consecutive k among synchronization map values.
(66) Each of the groups is defined as one cell, and a cell start value and a cell end value of each cell may be determined as follows.
V.sub.si=(minimum value of synchronization surrogate attribute within group)−(random number between e.sub.min and e.sub.max)
V.sub.fi=(maximum value of synchronization surrogate attribute within group)+(random number between e.sub.min and e.sub.max)
(67) (where e.sub.min is a minimum error value, and e.sub.max is a maximum error value)
(68) First, a group may be generated by two consecutive values among synchronization dictionary values, and a cell start value and a cell end value may be determined by adding an arbitrary error value between the minimum and maximum errors given to each group.
(69) As described above, a cell refers to a unit for sectionalizing a synchronization dictionary domain, and has a starting value and an end value, so that the cell is defined to include a predetermined number or more of synchronization dictionary values between the starting value and the end value. The number of synchronization dictionary values existing between the start value and the end value of the cell is referred to as a cell frequency.
(70) The start value and the end value of the cell may not be equal to the synchronization dictionary value existing in the synchronization dictionary, the start value of the cell refers to a value less than the smallest synchronization dictionary value in the cell by an arbitrary value e.sub.1, and the end value of the cell refers to a value greater than the largest synchronization dictionary value in the cell by an arbitrary value e.sub.2, in which e.sub.1 and e.sub.2 refer to error values and ensure re-identification impossibility of the synchronization dictionary value included in the cell.
(71) The cell may be defined as a formula as follows.
(72) 1) In a domain Dom(D) of a synchronization dictionary D, an i.sup.th cell C.sub.i is defined as the following attribute.
Ci=(V.sub.si,V.sub.fi,θ.sub.i),V.sub.fi,V.sub.fi∈Dom(D),V.sub.si>V.sub.fi,θ.sub.i>=1
(73) (where V.sub.si is a start value of the i.sup.th cell, V.sub.fi is an end value of the i.sup.th cell, and θ.sub.i is an i.sup.th cell frequency)
(74) In addition, θ.sub.i is the number of mutually different synchronization dictionary values that is greater than V.sub.si and less than V.sub.fi, and is not defined for a cell without a synchronization dictionary value so that a cell frequency value is a natural number greater than or equal to 1.
(75) 2) A size of the cell C.sub.i is defined as follows.
size(C.sub.i)=V.sub.fi−V.sub.si
(76)
(77)
(78) In the grouping process, as k increases, the cell size may become larger, and the error may be increased accordingly. When the cell sectionalization is performed, it is preferable to prevent the error from being increased by setting a maximum cell size C.sub.max so as to exclude a value from the grouping when the cell size becomes larger than the maximum cell size in a case of performing the grouping in the unit of consecutive k.
(79) An upper portion of
(80) The synchronization attribute refers to an attribute formed based on section information of the cell formed as described above, that is, the starting value and the end value obtained by the cell sectionalization, and may be an attribute assigned to an anonymous data set so as to be used to combine anonymous data.
(81) The synchronization target attribute value, which is the original personal identification attribute, may have a very high personal re-identification risk in that a person and an attribute value correspond to each other one-to-one, and the synchronization dictionary value may also have a high personal re-identification risk because the eigenfunction is a one-to-one correspondence function.
(82) Meanwhile, the synchronization attribute has an attribute value in which all personal identification attribute values of an original are removed, error values are randomly added, and thus there is no possibility of personal re-identification, so that an anonymous data set to which the synchronization attribute is assigned to substitute for an original personal identification attribute may be freely distributed to an outside without fear of leakage of personal information, and a plurality of distributed anonymous data sets may be combined with each other based on the synchronization attribute so as to be used for statistical analysis and the like.
(83) As described above, the synchronization dictionary applied to the generation of the synchronization attribute may be classified into the individual synchronization dictionary and the local/global synchronization dictionaries.
(84) When the local synchronization dictionary is applied, the synchronization attribute may be generated after performing the cell sectionalization based on a new local synchronization dictionary (union) in which an individual synchronization dictionary extracted from a synchronization target attribute of a corresponding data set is included in an existing local synchronization dictionary having the same type of attribute.
(85) When the local synchronization dictionary is applied, sectionalization may be performed based on more synchronization dictionary values as compared with the individual synchronization dictionary in the cell sectionalization process, so that error values added when defining each cell may be reduced. In addition, an effective frequency of each cell may be reduced while ensuring re-identification impossibility, so that the combination accuracy may be increased.
(86)
(87) The scheme of using the global synchronization dictionary is the same as the scheme of using the local synchronization dictionary. However, since the number of mutually different dictionary values of the global synchronization dictionary may be much greater than the number of mutually different dictionary values of the local synchronization dictionary, a refined definition of a cell may be provided, so that a size of the error may be reduced when the global synchronization dictionary is used.
(88) As previously defined, the synchronization attribute refers to an attribute used when combining the anonymous data sets instead of the combination target personal identification attribute of the original personal identification data set. Information on start and end values of the sectionalized cell may be used when combining the anonymous data sets, and the information on the start and end values of the sectionalized cell may be stored in the synchronization attribute.
(89) However, when storing a synchronization attribute value of the anonymous data set, it is preferable to combine the start value and the end value to store a combination result as a single data value, and separate the information on the start value and end value from the data value when combining the anonymous data sets to use a separation result for the combination, rather than storing each of the start and end values of a cell section as a data value of the synchronization attribute.
(90) For example, a value obtained by concatenating the start value and the end value, which are two values of the cell section, may be applied as the synchronization attribute value. The synchronization attribute may have information on the start value and the end value of the sectionalized cell, and may have any form that allows the start and end values to be separated.
(91) The ‘synchronization attribute having the value obtained by concatenating the start value and the end value of the cell section’ exemplified above is the most typical form of the synchronization attribute and thus referred to as an ‘anonymous identifier’. Hereinafter, in the description related to the synchronization attribute, for convenience of description, the ‘anonymous identifier’ will be described as an example, in which the ‘anonymous identifier’ is a synonym concept substituting for the ‘synchronization attribute’.
(92)
(93) Meanwhile, the anonymous identifier, that is, the synchronization attribute may be generated as a single anonymous identifier that is generated by allowing only one personal identification attribute value to correspond to one cell, and multiple anonymous identifiers that are generated by allowing a plurality of personal identification attribute values to correspond to one cell.
(94) As illustrated in
(95) Meanwhile, as illustrated in
(96) As a k value increases, anonymity of the k-multiple anonymous identifiers may become higher. In addition, the single anonymous identifier may correspond to a 1-multiple anonymous identifier, and may have anonymity lower than the anonymity of the multiple anonymous identifiers.
(97) In the above description, for convenience of description and understanding, a case where all personal identifiers in the original data set are different from each other has been described as an example. However, the anonymous identifier of the present invention may be naturally applied even when one personal identifier is exhibited multiple times in the original data set.
(98) When one personal identifier is exhibited multiple times in the original data set, the cell sectionalization may be separately performed with reference to the synchronization table for each duplicate records of the personal identifier in the original data set. Accordingly, mutually different error values may be added to synchronization table values corresponding to the same personal identifier, so that a plurality of anonymous identifier values different from each other may exist for the same person.
(99) Meanwhile, a process of generating an anonymous identifier may be classified, depending on an operation environment thereof, into an arrangement generation scheme of generating anonymous identifiers for a plurality of personal identifiers at once as illustrated in
(100) Next, the synchronization operation will be described in detail.
(101) When two anonymous data sets generated in mutually different safe spaces and distributed to the outside have the same synchronization attribute (anonymous identifier), an operation of combining the two data sets based on a common anonymous identifier is defined as a ‘synchronization operation’ or ‘anonymous matching’.
(102) One or more synchronization attributed in an anonymous data set are collectively defined as a ‘synchronization map’ of the data set. The synchronization map existing in the anonymous data set may include cell section information to which an arbitrary error is added, not the original personal identification attribute value, so that there is no personal identifier value serving as a reference when combining personal records.
(103) In a case of a general personal identifier that allows only one value for one person, such as the resident registration number, two identifiers may be compared to each other by an existing numeric or character comparison operator to determine whether the two identifiers have the same value without defining a separate operator for a matching operation to determine whether the two identifiers correspond the same person.
(104) However, the anonymous identifier according to the present invention may have a plurality of anonymous identifier values different from each other even for the same person, so that the same person may not be found by simply matching numbers or characters.
(105) Therefore, according to the present invention, in order to determine whether two anonymous identifiers generated by applying the same eigenfunction f for the same personal identifier domain correspond to the same person, an ‘anonymous matching operator’ defined as follows may be applied, in which the anonymous matching operator is denoted by ‘#’.
(106) Anonymous matching operation a #b=(true, C(a, b)) or (false, 0) Anonymous identifier a=(a.s, a.e), where a.s is a cell start value of a, and a.e is a cell end value of a Anonymous identifier b=(b.s, b.e), where b.s is a cell start value of b, and b.e is a cell end value of b C(a, b)=synchronization validity reliability
(107) In other words, the anonymous matching operator may process as a matching success when there is an overlapping portion in cells of the two anonymous identifiers, and may provide a synchronization validity reliability value representing an overlapping degree of the two cells (validity reliability that the anonymous identifiers a and b represent the same person) as a result of an anonymous combination.
(108) If there is no overlapping area in cell sections of the anonymous identifiers a and b, a result value of the anonymous matching may be false, and at this time, the synchronization validity reliability value may be 0.
(109) Meanwhile, when there is an overlapping area in the cell sections of the anonymous identifiers a and b, a result may be true, which means that probability of the two anonymous identifiers representing the same person may become higher as the synchronization validity reliability value increases.
(110) The synchronization validity reliability value may be a value greater than 0 and less than 1.
(111) A user may preset a minimum synchronization validity reliability threshold C.sub.min, so that an anonymous matching success result having reliability less than the threshold may be regarded as a failure so as to be excluded from the combination.
(112) As described above, the synchronization validity reliability refers to an indicator representing combination validity of a record pair existing in the two anonymous data sets, that is, an indicator representing reliability of each combination. The synchronization validity reliability may be calculated as follows.
(113) Assuming that there are synchronization maps M.sub.A and M.sub.B generated for the same original identifier from two original data sets A and B, and there are a cell C.sub.i.sup.MA of M.sub.A (1≤i≤n is the number of cells of M.sub.A) and a cell C.sub.j.sup.MB of M.sub.B (1≤j≤m, M is the number of cells of M.sub.B), the following process may be performed for C.sub.i.
(114) 1) Search for the cell C.sub.j.sup.MB having a section overlapping a section of the cell C.sub.i.sup.MA.
(115) 2) Perform record combination by obtaining a product set of records corresponding to the two cells for the cell C.sub.j.sup.MB having the section overlapping the section of the cell C.sub.i.sup.MA.
(116) 3) Designate synchronization validity reliability as a reliability value for a product set combination result record by calculating the synchronization validity reliability.
(117) The synchronization validity reliability w(C.sub.i.sup.MA, C.sub.j.sup.MB) of the cell C.sub.i.sup.MA and the cell C.sub.j.sup.MB may be calculated as follows.
w(C.sub.i.sup.MA,C.sub.j.sup.MB)=(size(C.sub.i.sup.MA∩C.sub.j.sup.MB)/size(C.sub.i.sup.MA))*(size(C.sub.i.sup.MA∩,C.sub.j.sup.MB)/size(C.sub.j.sup.MA))
(118) In order for the combination of the record of the cell C.sub.i.sup.MA and the record of the cell C.sub.j.sup.MB to be an actual valid record combination, an actual record of the same person has to exist within the same section, so that two records may exist in an overlapping section as illustrated in
(119)
(120) The synchronization validity reliability of the two overlapping sections in synchronization attributes of the two data sets of
w(C.sub.1.sup.M1,C.sub.2.sup.M2)=(size(C.sub.1.sup.M1∩C.sub.1.sup.M2)/size(C.sub.i.sup.M1))*(size(C.sub.1.sup.M1∩,C.sub.1.sup.M2)/size(C.sub.2.sup.M2))=(20/40)*(20/30)=1/3=0.333
w(C.sub.1.sup.M1,C.sub.2.sup.M2)=(size(C.sub.1.sup.M1∩C.sub.2.sup.M2)/size(C.sub.i.sup.M1))*(size(C.sub.1.sup.M1∩,C.sub.2.sup.M2)/size(C.sub.2.sup.M2))=(10/40)*(10/30)=1/12=0.083
(121) As described above, the synchronization operation (anonymous matching) may include an operation of searching whether a person having a specific anonymous identifier exists in a given synchronization map (anonymous identifier set), and an operation of connecting the records of two anonymous identifier sets to the same person.
(122) In this case, a scheme of comparing ranges of all anonymous identifier pairs, that is, a simple range comparison scheme may also be possible, but the simple range comparison scheme may be very inefficient. Therefore, in order to effectively perform an anonymous identifier matching operation, it is preferable to apply a discretization anonymous matching scheme that will be described as follows.
(123)
(124) In this case, when both A and b are single anonymous identifiers, a personal identifier surrogate value corresponding to a cell may always exist within a range (s, s+e.sub.max) from a cell start value s to the maximum error e.sub.max. Therefore, for each anonymous identifier a, in the anonymous identifier set A, a.sub.i may be inserted into buckets from a discretization section including a value of a start point a.sub.is of the cell of the anonymous identifier a.sub.i to a discretization section including a value of a.sub.i+e.sub.1, that is, in consecutive discretization sections corresponding to a section (a.sub.i.s, e.sub.max).
(125) Thereafter, for a single anonymous identifier b that is an anonymous combined target, while searching for a bucket of the discretization section from a discretization section of a start point b.s of the cell of the anonymous identifier b, an operation of a.sub.i#b may be performed when a.sub.i exists in the bucket to return a result of the operation of a.sub.i#b as a result of the anonymous matching operation A #b, and may be terminated.
(126) If there is no a.sub.i in the bucket of the discretization section, the above operation may be repeatedly performed by sequentially inspecting the buckets until the discretization section of b.s+e.sub.max, that is, in the discretization section corresponding to a range of (b.s, b.s+e.sub.max). If there is no a.sub.i in all the buckets until the last discretization section, the anonymous matching operation A #b may return false and may be terminated.
(127) Meanwhile, in a case of anonymous matching among k-multiple anonymous identifiers (k>1), multiple personal identifier surrogate values may exist in one cell, and a range of the multiple personal identifier surrogate values may be all sections of the cell.
(128) Therefore, in a case where A and b are k-multiple anonymous identifiers (k>1), first, for each anonymous identifier a.sub.i in the anonymous identifier set A, a.sub.i may be inserted into the buckets in all discretization sections from the discretization section including the value of the start point a.sub.i.s of the cell of a.sub.i to a discretization section corresponding to an end point a.sub.i.e of the cell of a.sub.i.
(129) Thereafter, for the anonymous identifier b, while searching for the bucket of the discretization section from the discretization section of the start point b.s of the cell of b, the operation of a.sub.i#b may be performed when a.sub.i exists in the bucket to return the result of the operation of a.sub.i#b as the result of A #b, and may be terminated. In addition, if there is no bucket having a.sub.i until a discretization section corresponding to b.e, the anonymous matching operation A #b may return a false and may be terminated.
(130) Although
(131) For example, only discretization sections from the cell start value to the maximum error e.sub.max may be compared when both A and B are single anonymous identifiers, and all discretization sections corresponding to an entire cell section may be compared in the case of k-anonymous identifiers (k>1).
(132) First, anonymous identifiers a.sub.i may be sequentially inserted into the bucket in all discretization sections corresponding to a search range of each of the anonymous identifiers a.sub.i of A. In the same way, for each anonymous identifier b.sub.j of an anonymous identifier set B, when the anonymous identifier a.sub.i of A exists in a bucket of each section from a discretization section corresponding to a start point of a search range of b.sub.j, a result of performing an operation of a.sub.i#b.sub.j may be returned, and the same operation may be performed for the next anonymous identifier b.sub.j+1 of the anonymous identifier set B. When an anonymous identifier no longer exists in B, A #B may be terminated.
(133) Meanwhile, as a scheme of combining two anonymous identifier sets, a unit combination scheme, an arrangement combination scheme, and a mixed combination scheme may be applied according to an operation environment.
(134) As illustrated in
(135) The above scheme may be suitable for a data stream environment where new records are constantly generated, and may be a scheme of generating a combined result data stream by generating anonymous identifiers one by one for each newly generated record through the unit generation scheme, and performing the anonymous matching operation (a #b) on the anonymous identifiers a and b for each record of two anonymous data streams generated accordingly.
(136) As illustrated in
(137) Finally, as illustrated in
(138) As described above, the anonymous matching operator may be applied to determine whether two anonymous identifiers correspond to the same person. Depending on an amount of errors added in the process of generating the anonymous identifier, anonymous identifiers generated for the same person may be determined not to be for the same person, or different persons may be incorrectly determined as the same person.
(139) Since a frequency of occurrence of such errors depends on an maximum error value of random errors, errors in a result of anonymous matching between the anonymous identifiers may be minimized when errors in generating anonymous identifiers are minimized.
(140) Therefore, when an anonymous identifier is generated for the original data set, it is preferable to verify anonymous identifier generation conformity, and to regenerate the anonymous identifiers to satisfy minimum conformity if necessary.
(141) Hereinafter, a scheme of determining generation conformity of an anonymous identifier and a scheme of regenerating the anonymous identifier will be described.
(142) First, a scheme of determining generation conformity of an anonymous identifier when all personal identifiers in the original data set are different from each other will be described.
(143) In a case where an anonymous identifier is generated from an original personal identifier data set D, in which the total number of records in the data set D is defined as |D|, the number of mutually different personal identifiers in the data set D is defined as |P.sub.D|, and a personalization ratio PR(D) of D is defined as |P.sub.D|/|D|, the personalization ratio PR(D) of D may be 1 when all the personal identifiers in an original data set D have mutually different values.
(144) After an anonymous identifier is generated for each personal identifier of the original, in order to determine the generation conformity, anonymous identifier groups corresponding to personal identifiers may be searched for the generated anonymous data.
(145) An anonymous identifier corresponding to each personal identifier may be found by performing information-aggregating anonymous data combination that will be described below.
(146) Anonymous identifiers in which a result of the anonymous matching operation (#) is determined to be true may be grouped into one cluster based on the above-described minimum synchronization validity reliability threshold C.sub.min. In other words, a condition that all results of anonymous matching between all anonymous identifier pairs in one cluster are true for the threshold C.sub.min may be satisfied.
(147) The one cluster grouped as described above may correspond to anonymous identifiers corresponding to one personal identifier in the original. After all clusters are obtained, a cluster set G(D) may be arranged according to a size of each of the clusters.
(148) In an anonymous identifier cluster set G(D) generated for the original data set D, when number of clusters having one anonymous identifier=|G1| number of clusters having two anonymous identifiers=|G2| number of clusters having three anonymous identifiers=|G3|
(149) . . . , and number of clusters having v anonymous identifiers=|Gv|,
(150) the total cluster number |G(D)| in G(D) may be |G1|+|G2|+|G3|+ . . . +|Gv|, and the total generated anonymous identifier number |A| may be |G1|+2|G2|+3|G3|+ . . . +v|Gv|.
(151) Based on the minimum synchronization validity reliability threshold C.sub.min, when an operation of generating an anonymous identifier set A(P.sub.D) corresponding to a personal identifier set P.sub.D existing in the data set D is defined as M(P.sub.D, A(P.sub.D)), and a degree at which an anonymous identifier corresponds one-to-one with one personal identifier is defined as anonymization conformity EA(M(P.sub.D, A(P.sub.D))) of the operation M(P.sub.D, A(P.sub.D)), the anonymization conformity may be |G1|/|P.sub.D|.
(152) If |G1|=|P.sub.D|, the anonymization conformity may be 1, which represents that the generation conformity of the anonymous identifier is perfectly ensured.
(153) When the anonymization conformity is less than 1, it may mean that ranges of anonymous identifiers generated for personal identifiers of at least two persons overlap each other so that the anonymous identifier has been generated to incorrectly correspond to the personal identifier, and an error rate may be 1−EA(M(P.sub.D, A(P.sub.D))).
(154) The anonymous identifiers generated as errors may be all anonymous identifiers existing in the cluster having at least two single anonymous identifiers. Therefore, the total number T.sub.A of single anonymous identifiers generated as errors may be |G(D)|−|G1|, and an error rate ET.sub.A may be (|G(D)|−|G1|)/|G(D)|.
(155) When a result of an anonymous identifier generation operation M(P.sub.D, A(P.sub.D)) is less than or equal to a minimum allowable threshold EA.sub.min of the anonymization conformity, the anonymous identifiers generated as errors may be regenerated and corrected as follows.
(156) In the previous example, since the anonymous identifiers existing in clusters G2 to Gv are generated incorrectly, a new anonymous identifier may be regenerated for the personal identifiers corresponding to the anonymous identifiers of the clusters, such that the anonymous identifier may be regenerated until a discretization section of a newly generated anonymous identifier cell does not overlap a range of the cluster G1 that is correctly generated.
(157) In order to reduce the number of repetitions, the maximum error value e may be reduced to perform a regeneration operation.
(158) In general, after one anonymous identifier is left in the cluster of Gk (1<k≤v) and the remaining k−1 anonymous identifiers are deleted, while a new anonymous identifier are generated for each of k−1 personal identifiers corresponding to the deleted anonymous identifiers, the generation may be repeatedly performed until the new anonymous identifier does not belong to the existing cluster and forms an independent cluster, such that the anonymization conformity EA((P.sub.D,A(P.sub.D))) may be reverified whenever a new cluster is generated, and the generation may be terminated when the minimum conformity is satisfied.
(159) Next, a process of determining conformity and regenerating an anonymous identifier when not all personal identifiers existing in the original data set are mutually different values, and one personal identifier is exhibited multiple times in the original data set will be described.
(160) When one personal identifier is exhibited multiple times in the original data set, an anonymous identifier has to be generated for the same person by the number of duplicates of the personal identifier, so that personal identifier duplication characteristics of the original data set may be extracted as follows. Total number of records in the original data set D=|D| number of records in which the personal identifier is duplicated one time=P1 number of records in which the personal identifier is duplicated two times=P2 number of records in which the personal identifier is duplicated three times=P3
(161) . . . , and number of records in which the personal identifier is duplicated w times=Pw
(162) In this case, the total person number |P.sub.D| may be P1+P2+ . . . +Pw, the total record number |D| may be P1+2P2+3P3+ . . . +wPw, and the generated single anonymous identifiers may ensure perfect conformity when the following conditions are satisfied.
|PD|=|G1|+|G2|+|G3|+ . . . +|Gv|
(163) where P1=|G1|, P2=|G2|, . . . , and Pw=|Gv|
(164) When the anonymization conformity is less than 1, the total number of anonymous identifiers generated as errors may be (P1-|G1|)+(P2−|G2|)+ . . . +(Pv−|Gv|).
(165) The scheme of regenerating and correcting the anonymous identifiers generated as the errors may be similar to the scheme described above, but the correction may be performed for each cluster as follows.
(166) Each cluster may be inspected to determine whether the cluster includes only anonymous identifiers corresponding to the personal identifier of the same person, and when anonymous identifiers corresponding to other persons exist in the cluster, all the anonymous identifiers corresponding to other persons may be deleted.
(167) A new anonymous identifier may be generated for each personal identifier corresponding to the deleted anonymous identifiers, such that the new anonymous identifier may be inspected to determine whether the new anonymous identifier is included in a cluster of a corresponding person, and the generation may be repeatedly performed until the new anonymous identifier is included in the cluster of the corresponding person.
(168) In order to reduce the number of repetitions, the maximum error value e may be reduced.
(169) In this way, the new anonymous identifier may be regenerated for erroneous personal identifiers until a condition of the minimum allowable threshold EA.sub.min of the anonymization conformity is satisfied.
(170) Next, the scheme of determining the conformity and regenerating the anonymous identifier in a process of generating k-multiple anonymous identifiers (k>1) for the personal identifier of the original data set will be described.
(171) When all personal identifiers in the original data set are |P.sub.D| mutually different personal identifiers (|P.sub.D|/|D|=1), in a case where the individual synchronization dictionary is used, the anonymous identifier may be generated by using only the personal identifiers existing in the original data set, so that one k-anonymous identifier may correspond to k personal identifiers, and |P.sub.D|/k mutually different k-multiple anonymous identifiers may be generated.
(172) When the number of anonymous identifiers that are actually generated is smaller or larger than the above number, it means that an error has occurred, so that the regeneration may be performed to allow the number of anonymous identifiers that are actually generated to correspond to a correct number.
(173) In order to verify the conformity between the k-multiple anonymous identifiers generated when all the personal identifiers in the original data set have mutually different values, the following characteristic information may be extracted from the original data set.
(174) For the anonymous identifier cluster set G(D) generated for the original data set D, number of clusters having one anonymous identifier=|G1| number of clusters having two anonymous identifiers=|G2| number of clusters having three anonymous identifiers=|G3|
(175) . . . , and number of clusters having v anonymous identifiers=|Gv|
(176) In this case, the total cluster number IG(D)I in G(D) may be |G1|+|G2|+|G3|+ . . . +|Gm|, and the total generated anonymous identifier number |A| may be |G1|+2|G2|+3|G3|+ . . . +v|Gv|.
(177) When k=2, only the cluster G2 may be 2-multiple anonymous identifiers that are correctly generated, and the k-multiple anonymous identifiers correctly generated for a given value of k may be multiple anonymous identifiers existing in clusters Gk.
(178) In other words, anonymous identifiers correctly generated by a k-multiple anonymous identifier generation operation M(PD, A(PD)) may be identifiers existing in the cluster Gk, and all anonymous identifiers existing in the remaining clusters may be identifiers that are incorrectly generated.
(179) Based on the minimum synchronization validity reliability threshold C.sub.min, when an operation of generating a k-multiple anonymous identifier set Ak(P.sub.D) corresponding to the personal identifier set P.sub.D existing in the data set D is defined as M(P.sub.D, Ak(P.sub.D)), and a degree at which one of the k-multiple anonymous identifiers (k>1) makes 1:k correspondence with personal identifiers of k persons, that is, the anonymization conformity is defined as EA(M(P.sub.D, Ak(P.sub.D))), the anonymization conformity may be |Gk|/|PD|.
(180) If |Gk|=|PD|/k, the anonymization conformity may be 1, which represents that the generation conformity of the anonymous identifier is perfectly ensured.
(181) When the anonymization conformity is less than 1, it may mean that the k-multiple anonymous identifiers have been incorrectly generated to correspond to personal identifiers of persons that are less than or greater than k, and an error rate may be 1−EA(M(P.sub.D, Ak(P.sub.D))).
(182) In the previous example, clusters G1, . . . , and G(k−1) may correspond to clusters generated such that the numbers of corresponding personal identifiers are smaller by (k−1), (k−2), . . . , and 1, respectively, whereas clusters G(k+1), . . . , and Gv may correspond to clusters generated such that the numbers of corresponding personal identifiers are larger by 1, 2, . . . , and (v−k), respectively.
(183) Therefore, the total number T.sub.A of k-multiple anonymous identifiers generated as errors may be |G(D)|−|Gk|, and the error rate ET.sub.A may be (|G(D)|−|Gk|)/|G(D)|.
(184) The k-multiple anonymous identifiers generated as errors may be regenerated and corrected as follows.
(185) For the clusters G(k+1) to Gv to which more personal identifiers correspond, the anonymous identifiers may be deleted while leaving only k anonymous identifiers. New k-multiple anonymous identifiers may be generated for the personal identifiers corresponding to the deleted k-multiple anonymous identifiers, respectively.
(186) A discretization section of a newly generated k-multiple anonymous identifier cell may be added to the clusters G1, G2, and G(k−1), and the generation may be repeatedly performed until the number of anonymous identifiers of each of the clusters reaches k.
(187) In order to reduce the number of repetitions, the maximum error value e may be reduced to perform the regeneration operation.
(188) In this way, the new anonymous identifiers may be regenerated for the erroneous personal identifiers until the condition of the minimum allowable threshold EA.sub.min of the anonymization conformity is satisfied.
(189) Next, the process of determining the conformity and regenerating the anonymous identifier when not all the personal identifiers existing in the original data set are mutually different values, and one personal identifier is exhibited multiple times in the original data set will be described.
(190) When one personal identifier is exhibited multiple times in the original data set, an anonymous identifier has to be generated for the same person by the number of duplicates of the personal identifier, so that the personal identifier duplication characteristics of the original data set may be extracted as follows. Total number of records in the original data set D=|D| number of records in which the personal identifier is duplicated one time=P1 number of records in which the personal identifier is duplicated two times=P2 number of records in which the personal identifier is duplicated three times=P3
(191) . . . , and number of records in which the personal identifier is duplicated w times=Pw
(192) In this case, the total person number |P.sub.D| may be P1+P2++Pw, the total record number |D| may be P1+2P2+3P3+ . . . +wPw, and the generated k-multiple anonymous identifiers may have anonymization conformity of 1 when the following conditions are satisfied so that the perfect conformity may be ensured.
|PD|/k=|G1|+|G2|+|G3|+ . . . +|Gv|
(193) where P1=|G1|, P2=|G2|, . . . , and Pw=|Gv|
(194) When the anonymization conformity is less than 1, the scheme of regenerating and correcting the anonymous identifiers generated as the errors may be similar to the scheme described above, but the correction may be performed for each cluster as follows.
(195) Each cluster may be inspected to determine whether the cluster includes only k-multiple anonymous identifiers corresponding to the personal identifiers of the same k persons, and all the anonymous identifiers that do not correspond to the same k persons may be deleted. A new k-multiple anonymous identifier may be generated for all the personal identifiers corresponding to the deleted k-multiple anonymous identifiers, such that the new k-multiple anonymous identifier may be inspected to determine whether the new k-multiple anonymous identifier is included in the existing cluster, and the generation may be repeatedly performed until the new anonymous identifier is included in the existing cluster.
(196) In order to reduce the number of repetitions, the maximum error value e may be reduced.
(197) In this way, the k-multiple anonymous identifier may be regenerated for the erroneous personal identifiers until the condition of the minimum allowable threshold EA.sub.min of the anonymization conformity is satisfied.
(198) The anonymized data according to the present invention as described above may be used by an information-transferring anonymous data combination utilization scheme, an information-combining anonymous data combination utilization scheme, an information-aggregating anonymous data combination utilization scheme, or the like according to a scheme of combining and utilizing the anonymized data.
(199) The information-transferring anonymous data combination utilization scheme refers to a scheme of anonymizing required information in an information production area and transferring the anonymized information to an information consumption area as anonymous data for use.
(200) The information (P, Up) produced in the information production area refers to personalized information Up produced for a personal identifier P. Assuming that the information Up is general information other than personal information, when the information (P, Up) is produced in the production area, an anonymous identifier AP may be generated for the personal identifier P in a safe space of the production area and converted into (anonymous identifier AP, production information Up).
(201) The information anonymized as described above may be transferred to an outside of the safe space, so the information may be transferred to the information consumption area.
(202) In the information consumption area, received anonymous information may be collected in chronological order.
(203) When a request for using personalized information of a personal identifier Q occurs in the information consumption area, an anonymous identifier AQ for the personal identifier Q may be independently generated, and anonymous matching may be performed with anonymous identifiers of a personalized information DB that are received and collected in the information production area, so that production information corresponding to the same person may be utilized.
(204) The information (AP, Up) of the APs that succeeded in the anonymous matching with the anonymous identifier AQ for the personal identifier Q refers to information produced for the personal identifier Q=P in the information production area, and the information (AP, Up) may be used for a person Q in the information consumption area. In this case, an anonymous identifier generation scheme, an anonymous identifier matching scheme, and an anonymous data combination scheme may be performed through a combination of various schemes described above.
(205) The information production area and the information consumption area may be directly connected to each other, but several stages of independent personal information management areas may be interposed between the information production area and the information consumption area. In each intermediate area, the anonymous matching for the same personal identifier may be performed in a relay manner, and a result of the anonymous matching may be finally transferred to the information consumption area.
(206) As illustrated in
(207) An information combination scheme may be classified into a serial combination scheme and a parallel combination scheme. According to the serial combination scheme, two among several information production areas R1, R2, . . . , and Rn may be connected to each other as a pair ((R1, R2).fwdarw.(R2, R3).fwdarw.(R3, R4).fwdarw. . . . .fwdarw.(Rn−1, Rn)), so that as described above in the information-transferring combination utilization scheme, anonymous data combination may be repeatedly performed for each area pair, and the personalization information in which all data are combined may be generated in the last area Rn when the anonymous data combination is performed for the last pair.
(208) According to the parallel combination scheme, an area Rp where final combination data is collected may be designated for all the information production areas R1, . . . , and Rn, so that the anonymous data combination may be performed between area pairs (R1, Rp), (R2, Pp), . . . , and (Rn, Rp) to store a final result of the anonymous data combination in the area Rp.
(209) As illustrated in
(210) When there is an anonymous data set D generated in the information production area R, a scheme similar to the discretization anonymous matching scheme described above may be used find the anonymous identifiers of the same person.
(211) A personal identifier domain may be divided into preset discretization sections, and the generated information (AP, Up) may be stored in a bucket in a discretization section corresponding to a discretization section of each anonymous identifier AP of the anonymous data set produced in R. After the above operation is performed for all anonymous identifiers in R, when at least two anonymous identifiers, for example, a, b, and c exist in the bucket of a corresponding section from a first discretization section, anonymous matching operations a #b, a #c, and b #c may be performed for each pair, so that a separate list may be constructed by collecting anonymous identifiers of a person, which have the same result of the anonymous matching operations. In this case, it may be determined as the same person based on a synchronization validity minimum reliability threshold set by a user.
(212) In this way, when the anonymous matching between anonymous identifiers in buckets of all discretization sections are performed, the anonymous identifier and the production information generated for each of individual persons in the information production area may be recognized as a separate list. Subsequently, when a required aggregation operation is performed on the production information Up, an aggregation result for each of the individual persons may be obtained from the anonymous data set, and the anonymous data may be transferred to other information consumption areas.