Method for detecting document similarity

Abstract

This invention is related to a document similarity detection method in which non-prefix-free (NPF) coding of the input documents as a transform is used to provide the privacy. The method includes the following steps, encoding the symbols of the documents with non-prefix-free coding scheme, obtaining the transformed documents, calculating the similarity score of the transformed documents with the normalized compression distance similarity metric.

Claims

1. A method for detecting document similarity, comprising the following steps: encoding a plurality of symbols for a plurality of documents with a non-prefix-free coding scheme, obtaining a plurality of transformed documents, wherein a non-prefix-free (NPF) coding of a plurality of input documents as a transform is used to provide privacy, wherein the plurality of transformed documents are the plurality of documents being encoded with the non-prefix-free coding scheme, calculating a similarity score of the plurality of transformed documents by a normalized compression distance similarity metric; wherein the plurality of symbols of the plurality of documents are replaced with a plurality of variable bit-length codewords, and a first variable bit-length codeword of the plurality of variable bit-length codewords is a prefix of a second variable bit-length codeword of the plurality of variable bit-length codewords, wherein the similarity score is below a certain threshold, then the similarity score implies a plurality of original documents are related since a similarity between the plurality of transformed documents reflect the similarity of the plurality of original documents, determining a number of the plurality of symbols in the plurality of documents, the plurality of symbols in the plurality of documents are written by a same alphabet, preparing a set of sequence numbers randomly, a number of the set of sequence numbers is equal to the number of the plurality of symbols in the plurality of documents, adding an integer to all numbers of the set of sequence number, the integer is equal to or greater than 1, calculating a binary representation bit sequence of each number in the set of sequence number, deleting a leftmost bit from the each of the plurality of variable bit-length codewords, the leftmost bit is set to 1, obtaining a codewords set.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 is a graph showing the correlation between the similarity scores obtained between the NPF-encoded and plain documents.

(2) FIG. 2 is a graph showing the difference between a and a obtained in an experimental result using the inventive method.

(3) FIG. 3 is another graph showing the comparison of similarity scores obtained between a sample document and the rest of the corpus 3 in plain (public) format and NPF-encoded (private) format.

(4) FIG. 4 is a schematic flow diagram illustrating an embodiment of the method of the invention.

DETAILED DESCRIPTION OF THE EMBODIMENTS

(5) Let A=a.sub.1a.sub.2 . . . a.sub.p and B=b.sub.1b.sub.2 . . . b.sub.q represent two documents of respectively p and q symbols long, where each character a.sub.i and b.sub.j is drawn from the alphabet ={.sub.1, .sub.2 . . . .sub.6} for valid 1ip and 1jq. The similarity score computed with the method S( ) is denoted by S(A,B).

(6) The normalized compression distance similarity metric used in this method is based on the fact that if two documents are similar then each of them should be able to be compressed well with the model extracted from the other. In other words, when one of them is concatenated to the end of the other document, and send to a compressor, the compression ratio obtained should be reflecting their similarity since combined similar documents would result better compression then

(7) the unrelated pairs. Normalizing the compression ratio by respecting possible different lengths of the documents, the NCD similarity score is represented by the formula

(8) $NCD (A, B) = \frac{C (AB) - \min (C (A), C (B))}{\max (C (A), C (B))}$

(9) Here, C( ) is an arbitrary compressor, and AB is the concatenation of the documents A and B. The score is between 0 and 1. The similarity increases towards 0 and decreases towards 1.

(10) In this method the documents are encoded with a non-prex-free coding scheme. Assume W={w.sub.1, w.sub.2 . . . w.sub.6} is a codeword set generated for the alphabet ={.sub.1, .sub.2 . . . .sub.6}. Each w.sub.i is a bit sequence of arbitrary length, which may be a prefix of another codeword w.sub.j from W. Given a sequence T=t.sub.1t.sub.2 . . . t.sub.n for t.sub.i, the NPF: .fwdarw.W coding replaces all occurrences of .sub.i with its corresponding w.sub.i on T. T=NONPREFIXFREE ={E, R, F, N, I, O, P, X} W={01, 0, 111, 001, 010, 1111, 00, 1} NPF(T)=0011111001000111010111100101

Example. 1: An Example Non-Prefix-Free Coding of the Sequence T=NONPREFIXFREE

(11) Example 1 demonstrates a sample NPF coding on a given T=NONPREFIXFREE.

(12) The codewords in W are not prefix free. For example, the codeword 1 assigned to symbol X is a prefix of the codewords 111, 1111 assigned to F and O respectively. Other instances of that non-prefix-free property appear on W. The encoded bit stream for the sample T is shown. Notice that the codeword boundaries cannot be determined on the encoded stream due to the non-prefix-free property. The same stream can have many other parses according to the codeword set W, e.g., the initial bits 0011111 can also represent the sequence PFXX, and so on. Due to the lack of unique decodability, the non-prefix-free codes are not much interesting in terms of data compression.

(13) Variable-length codes can also be used for security purposes. It had been previously shown that the prefix-free codes such as the Huffman codes are difficult to cryptanalyze, and it is hard to extract the original sequence from the prefix-free encoded version. The situation worsens in case of non-prefix-free codes since there are a lot less restrictions to be used in a ciphertext-only attack.

(14) Despite the difficulty for extracting the original sequence, the NPF coding scheme actually preserves the syntactic structures of the documents. The portions that are equal to each other in distinct documents are mapped to the same bit stream. Thus, the shared information content is preserved after the NPF transform. For that reason the NCD distance between the documents are preserved in their NPF transformed versions. More formally, |S(A,B)S(NPF(A),NPF(B))| is expected to be small.

(15) The Method

(16) Assume Alice and Bob own the document A and the document B respectively. We assume both documents have the same alphabet . They would like to know how similar the documents are without revealing the contents to each other. To achieve this goal, they encode their documents with the same NPF codeword set W, that they had previously agreed on.

(17) The codeword set W for the NPF coding can be chosen arbitrarily since the aim

(18) here is not efficiency in data representation, but private data comparison. The important thing is to let codewords to be prefixes of others. With this in mind, the codeword set W for a given alphabet can be prepared as follows.

(19) The sequence X=<1, 2 . . . > is shuffled by the help of a pseudo-random number generator PRNG to get a random permutation of numbers from 1 to . The PRNG is first initialized with an arbitrarily chosen seed. For each i=1 to in order, the values at x.sub.i and x.sub.r are exchanged, where r is the random number generated by the PRNG in between 1 and . The procedure is repeated K times for a large K, e.g., 1000 times in the experiments conducted. At the end a random permutation X=<x.sub.1, x.sub.2, . . . , x.sub.>, for x.sub.i{1, 2, 3 . . . } and x.sub.ix.sub.j unless i=j, is prepared.

(20) The codewords in W are generated by setting each w.sub.i to the minimal binary representation of (x.sub.i+1). The minimal binary representation bit sequence for an integer i is the binary string of length log.sub.2i without the leftmost bit set to 1, e.g., the minimal binary representation of 5=(101).sub.2 is 01 that is of length 2=log.sub.25 For example, assuming an alphabet size =8, and the random permutation vector as X=<2, 5, 3, 6, 7, 1, 4, 8>, the codewords set W is generated as W={1, 10, 00, 11, 000, 0, 01, 001} by the described procedure.

(21) Notice that by using the same seed for the PRNG initialization and repeating the shuffling operation for the same K times, both parties will have the same permutation, and thus, the same W.

(22) Once Alice and Bob encoded their documents with the same non-prefix-free codeword set W, they are ready to compute the similarity score of the transformed

(23) documents. Since extracting the original documents from their NPF encoded transforms is hard due to the lack of codeword boundaries on the transformed bit sequences, they can exchange the documents and compute the similarity with the NCD metric described above. If the similarity score is below a certain threshold, then this implies the original documents are related since the similarity between the transformed documents reflect the similarity of the originals.

(24) However, in this scenario, Alice and Bob may not want to exchange their documents even they are NPF encoded. That is because, although practically the privacy of the documents can be assumed to be established, the parties may be skeptical about the provable security of the system. Such a case may be expected when Alice and Bob are agents of intelligence services, who would like to check whether both sides have similar knowledge on an issue. On the other hand, it is not very likely if Alice and Bob are the program committee chairs of related conferences, who would like to examine whether there exist simultaneous submissions.

(25) Anyway, if there exist a concern of that type, both parties grant a third trusted party to access the NPF encoded documents, and that third party performs the NCD computation. Notice that since that trusted third party has zero knowledge about the NPF codeword set, and the documents, it is much harder for him or her to try a ciphertext only attack.

Experimental Results

(26) We have collected a corpus of 100 documents that are the daily columns of 5 distinct Turkish journalists. Each author is represented by 20 files, which makes a total of 100 documents. Each document is around 5 KB long in size.

(27) Our main curiosity in the experiments is to observe how much does the similarity in the private domain shown by

(28) NCD(NPF(A),NPF(B)) will be close to similarity in the public domain as shown by

(29) NCD(A,B).

(30) With that purpose we have computed the similarity scores between each possible pair of documents. Afterwards, we have encoded each document with the proposed NPF coding scheme by selecting a random seed, and using the PRNG of the C programming language (rand( ) command) to produce the random numbers required. We assumed the 256-symbol ASCII codes as the source alphabet of the documents. The NPF encoded document corpus was created in this way.

(31) There is a small nuance here in terms of practical usage. While saving the encoded documents we have represented each bit with a byte. The reason for that depends on the working mechanisms of the daily compressors, which are byte oriented. The compressors that are also used in the NCD software, assumes an input sequence as a byte stream instead of a bit stream. Thus, we preferred to spend one byte per bit in the NPF encoded sequences not to loose any precision in the compression phase of the similarity calculation with the NCD.

(32) FIG. 1 represents the linear correlation between the similarity scores obtained between the NPF encoded and plain documents. The similarity scores are between 0 (high similarity) and 1 (low similarity). The graph simply marks (, ) tuples, where NCD(NPF(A),NPF(B)) and NCD(A,B) which is of number

(33) $(\begin{matrix} 100 \\ 2 \end{matrix}) = 4950.$
The FIG. 2, shows the difference between and . It is observed that this difference is around 0.02 on the average, with some worst cases appearing around 0.04.

(34) FIG. 3 depicts the comparison of similarity scores obtained between a sample document and the rest of the corpus 3 in plain (public) format and NPF encoded (private) format. It is observed that the similarity scores between the NPF encoded documents are around 0.02 points better than the plain formats. This is because, besides the preserved real syntactic similarities, the NPF coding might also introduce some pseudo similarities also stemming from the variable length codes. For instance, a bit sequence that appears on both encoded documents does not imply that the documents on that section share a common phrase. However, the reverse is true that if the documents share a phrase, their NPF encoded transforms will also generate the same bit sequence there. Because of that reason the NPF encoded documents usually report a slightly better similarity score, which is the main source of that 0.02 points deviation in between a and a.

(35) We have presented a novel privacy preserving document similarity detection method. The privacy in the proposed method is based on the lack of unique decodability in non-prefix-free codes, which is different from the solutions offered to date that rely on mostly the homomorphic schemes and multi-party secure computation methods. While preserving the privacy, the NPF coding does not alter the patterns in the source files since it simply replaces symbols of the files with variable bit-length codewords. Thus, the regularities in between the documents are kept in their transformed versions also. This property led us to use the normalized compression distance (NCD) as a very suitable tool to be used for the similarity detection. The NCD similarity of two sequences is a value between 0 and 1. The experiments conducted on 100-files corpus revealed that the deviation between the NPF transformed and normal plain comparisons is around 0.02 points, which is quite acceptable in practical usage. Despite the NCD similarity metric, it might be possible to integrate the same idea of NPF coding transform to the other proposed similarity evaluations such as the ones based on n-gram symbol counts.

Method for detecting document similarity

Assignee

Inventors

Cpc classification

Classification Explorer

H03M7/30

ELECTRICITY

Classification Explorer

G06F40/12

PHYSICS

Classification Explorer

G06V30/418

PHYSICS

Classification Explorer

H03M7/4037

ELECTRICITY

Classification Explorer

H03M7/14

ELECTRICITY

Classification Explorer

G06F40/194

PHYSICS

Classification Explorer

G06F18/22

PHYSICS

Classification Explorer

G06F21/6245

PHYSICS

International classification

Classification Explorer

G06F40/194

PHYSICS

Classification Explorer

H03M7/14

ELECTRICITY

Classification Explorer

G06K9/62

PHYSICS

Classification Explorer

G06K9/00

PHYSICS

Classification Explorer

G06F40/12

PHYSICS

Classification Explorer

H03M7/30

ELECTRICITY

Classification Explorer

G06F21/62

PHYSICS

Classification Explorer

H03M7/40

ELECTRICITY

Abstract

Claims

Description