Method for information retrieval in an encrypted corpus stored on a server

11308233 · 2022-04-19

Assignee

Inventors

Cpc classification

International classification

Abstract

A method for information retrieval in an encrypted corpus stored on a server, from a digital request calculated on a customer device, containing a sequence of terms, includes the following steps: encryption of the request on a customer computer device and transmission of same to a database management server; and homomorphic calculation, on the server, of the encrypted response to the encrypted request recorded on the server. The method further comprises an additional requesting step performed on the customer device; and presentation of the result in an ordered form of the documents, in application of the processing of the previous step. The present disclosure also relates to a method for preparing a requestable base and to a method for information retrieval in an encrypted corpus.

Claims

1. A method for information retrieval in an encrypted corpus stored on a server, from a text request calculated on a customer computer device, including a succession of terms, the method comprising: encrypting the text request on the customer computer device; transmitting the encrypted text request to the server; performing homomorphic calculation on the server of an encrypted response to the encrypted text request recorded on the server; transmitting the encrypted response to the customer computer device; and decrypting the encrypted response on the customer computer device and extracting document identifiers, wherein decrypting the encrypted response comprises: a) calculating, on the customer computer device, when introducing a new requestable document, for each document belonging to the encrypted corpus, a first table and a second table different from the first table; the first table comprising, for each indexed term of the new requestable document, a number of occurrences of the indexed term in the new requestable document, and the second table indicating indices of each indexed term that is present in the new requestable document; b) encrypting the new requestable document and the second table, as well as encrypting, by a homomorphic encryption method, the first table, and transmitting the encrypted new requestable document, the encrypted second table, and the encrypted first table to the server for recording in a storage space, dedicated to a user or to a group of users; c) creating or updating an index on the customer computer device, the index associated with the user, for all documents accessible by the user, with the index being constituted by a table indicating for each indexed term a number of documents containing the indexed term; d) requesting, comprising: encrypting, on the customer computer device, the text request constituted by the succession of terms, by a homomorphic encryption belonging to the same cryptosystem as the encryption applied to the first table; transmitting the text request thus encrypted to the server for carrying out the homomorphic encryption and transmitting the encrypted response to the customer computer device and decrypting the encrypted response by the customer computer device; aggregating, by the customer computer device, the document identifiers of data contained in the encrypted response and in the index recorded on the customer computer device; and presenting a result in an orderly form of the documents on the customer computer device.

2. The method of claim 1, further comprising recreating, on the customer computer device, the index from encrypted information stored in the storage space of the server assigned to the user.

3. The method of claim 1, wherein operations performed on the server are implemented in parallel.

4. The method of claim 3, wherein the server is part of a cloud platform.

5. A method for information retrieval in an encrypted corpus stored on a server, from a text request calculated on a customer computing device, the method comprising: encrypting the text request including a succession of terms on the customer computing device, the encrypting performed using a homomorphic encryption belonging to a same cryptosystem as an encryption applied to a term frequency table; transmitting the encrypted text request to the server; performing homomorphic calculation, on the server, of an encrypted response to the encrypted text request; transmitting the encrypted response to the customer computing device; decrypting, on the customer computing device, the encrypted response and extracting document identifiers; aggregating, on the customer computing device when introducing a new requestable document, the document identifiers of data contained in the encrypted response and in an index recorded on the customer computing device, the index associated with a user, for all documents accessible by the user, with the index being constituted by a table indicating for each indexed term a number of documents containing the indexed term, the aggregating including creating a first table and a second table different from the first table, the first table comprising, for each indexed term of the new requestable document, a number of occurrences of the indexed term in the new requestable document, the second table indicating indices of each indexed term that is present in the new requestable document; encrypting the new requestable document and the second table, as well as encrypting, by a homomorphic encryption method, the first table, and transmitting the encrypted document, the encrypted second table, and the encrypted first table to the server for recording in a storage space, dedicated to a user or to a group of users; and presenting a result in an ordered form of documents on the customer computing device.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) The present disclosure will be best understood when reading the following description that relates to a non-restrictive exemplary embodiment, while referring to the appended drawings, wherein:

(2) FIG. 1 is a schematic view of a computer system according to the present disclosure, and

(3) FIG. 2 represents a schematic view of data flows between the various IT resources.

DETAILED DESCRIPTION

(4) Hardware Architecture

(5) FIG. 1 shows a schematic view of the hardware architecture of the present disclosure;

(6) It includes a customer computing device (1) connected to a server (2) by a computer network, for example, the Internet.

(7) The server (2) is associated with a memory device (3) for the recording of a database. The server (2) has a processor for performing digital processing.

(8) The server (2) and the memory devices (3) are in a particular example constituted by a set of distributed resources, for example, of the “cloud” type.

(9) Functional Architecture

(10) FIG. 2 shows an exemplary functional architecture.

(11) The customer computing device 1 provides initial processing of a document i consisting of a digital file 9 recorded in a working memory.

(12) Optionally, each term of the document is pre-processed by known means such as “stemming,” “stop list” (deletion of current words) and any other usual linguistic processing).

(13) Preparation of the Requestable Encrypted Files.

(14) The initial processing is divided into three tasks.

(15) The first task is to apply encryption to the document i with a known cryptographic method, for example, symmetric AES encryption and records an encrypted version (10) of this document on the customer device, and optionally on the server (2) or a third-party storage service. The corpus of thus defined encrypted documents constitutes the document base (32).

(16) A second task, performed in parallel or sequentially, consists in calculating an index of occurrences of the terms present in the digital file 9, and in recording a table TF.sub.i (14) of occurrences, in the form of a list of terms w.sub.j present in the document i, each of the terms w.sub.j in this list being associated with a number corresponding to the occurrence tfi.sub.j of the term w.sub.j in the document i.

(17) The table TF.sub.i (14) is therefore of the {[w.sub.j; tf.sub.ij]}.sub.j type; for a document i.

(18) A third task, performed in parallel or sequentially, consists in calculating a table Δdf.sub.i 15 corresponding, for each term w.sub.j, to the presence or not of the term in the document. This table Δdf.sub.i (15) is therefore of the {[W.sub.j|tf.sub.ij>0]}.sub.j type.

(19) The table TF.sub.i (14) is then encrypted using a homomorphic encryption method, for example, according to a method described in article Zhou, H., & Women, G. (février 2014). Efficient homomorphic encryption on integer vectors and its applications. In Information Theory and Applications Workshop (ITA), 2014 (pp. 1-9). IEEE.

(20) The result of this encryption of the table TF.sub.i (14) is a set of encrypted data (11). Each set of encrypted data (11) is transmitted by the customer computing device (1) to the server (2).

(21) The grouping of the encrypted data (11) constitutes an encrypted database (30) of all the {TF.sub.i}.sub.i.

(22) At the same time or sequentially, the table Δdf.sub.i (15) is encrypted using a known method, for example, AES and transmitted to the server (2) to record an encrypted file (12) on the server (2).

(23) All the encrypted files (12) recorded on the server form a database (31).

(24) Each encrypted file (12) recorded on the server (2) makes it possible to reconstitute a table df_A 13 by decryption with an algorithm inverse to the one used for the above-mentioned encryption.

(25) This table df_A (13) is calculated only on the customer computing device 1, from: Either all the encrypted tables of database 31 recorded on the server (2), after the transmission thereof on the customer computing device (1); or directly by updating a table df_A (13) locally recorded on the customer computing device 1, the update being performed each time a table Δdf.sub.i (15) is added.

(26) This data preparation step leads to the recording, on the server, of data that are not directly requestable and that do not reveal meaningful information about the content or the documents, especially in the event of an attack on the server or a malicious action by a privileged user.

(27) Requesting

(28) Requesting is performed by sending a text request (20) formed by a combination of words from the customer computing device (1).

(29) Optionally, this text request (20) is pre-processed by known means of “stemming,” “stop list” (deletion of current words) and any other usual linguistic processing.

(30) The text request (20) is encrypted using the same homomorphic encryption method as that used for encrypting the table TF.sub.i (14=to obtain an encrypted request (21).

(31) The encrypted request (21) is transmitted to the server (2) that records to make a request (40).

(32) By applying a homomorphic calculation on the data in the encrypted database (30) and the request (40), the server (2) calculates an encrypted response (41). The calculations performed on the server (2) may be implemented in a parallel and/or a distributed manner.

(33) This processing consists in calculating, in the encrypted domain, the number of occurrences of each term q.sub.k of the request (40) for each known document i.

(34) For each of the k terms q.sub.k and for each document i, the values tf.sub.i,j are counted for the cases where q.sub.k corresponds to a term w.sub.j, from the encrypted database (30) of tables {[w.sub.j; tf.sub.ij}.sub.i and in the encrypted space, without decrypting the variables w.sub.j, q.sub.k and tf.sub.i,j.

(35) All these counts constitute an encrypted response (41) that is transmitted to the customer computing device (1) and records it locally as an encrypted response (50).

(36) The customer is then able to decrypt the encrypted response (50) to calculate a decrypted response (51).

(37) Finally, the customer can combine the encrypted response 51 and the table df_A (13) to calculate a score TF-IDF (52) according to a known method.

(38) This score TF-IDF (52) constitutes a classification key for the documents i in the order of relevance to the text request (20).

(39) Optionally, the customer computing device 1 presents the results as a search engine and allows the user to find the corresponding record.