Full-text fuzzy search method for similar-form Chinese characters in ciphertext domain

11537626 · 2022-12-27

Assignee

Inventors

Cpc classification

International classification

Abstract

The invention discloses a full-text fuzzy search method for similar-form Chinese characters in a ciphertext domain. The method realises a fuzzy search in the Chinese ciphertext domain based on a symmetric searchable encryption scheme and an inverted index structure, supports a fuzzy search on Chinese characters having similar glyphs in ciphertext status, ensures that searching results are ordered, and supports a multi-keyword logical connection fuzzy search. The present invention uses a distributed search engine Lucene and a Chinese word segmentator IKAnalyzer to perform full-text word segmentation on a document and constructs a plaintext inverted index comprising similar-form Chinese characters by means of the established similar-form character library of 3,755 commonly used Chinese characters. Considering the security of the inverted index structure, each keyword in the plaintext inverted index and its corresponding document number are constructed in an encrypted chain form, and a B+ tree structure is used to speed up the search. The invention realizes a fuzzy search in a Chinese full-text ciphertext domain in a semi-trusted cloud server without false detection and missed detection.

Claims

1. A full-text fuzzy search method for similar-form Chinese characters in a ciphertext domain, the method performed by a system and comprising the following steps: S1.1, establishing a unique identifier set FILE(flie.sub.1, flie.sub.2, . . . flie.sub.n) of documents to be uploaded, where n represents the number of documents to be uploaded; S1.2, using a distributed search engine Lucene and a Chinese word segmentator IKAnalyzer to perform a full-text segmentation and a filtering on an uploaded document set, generating a plaintext inverted index EnIndex.sub.file=(w′.sub.1, w′.sub.2, . . . w′.sub.p) of the document set, where p is a length of the inverted index; S1.3, establishing a commonly used Chinese character dictionary based on N commonly used Chinese characters, and expanding it into a similar-form Chinese character dictionary library DICT based on commonly used similar-form Chinese characters, where N is a positive integer; S1.4, traversing a keyword w′ in the inverted index, when the keyword has a similar-form word, expanding the w′ to w=(w′, w.sup.1,w.sup.2, . . . ,w.sup.m) by using DICT, where (w.sup.1, w.sup.2, . . . , w.sup.m) is a set of similar-form words of w′, m represents the number of similar-form words of w; when w′ does not have a similar-form word, then w=w′; S1.5, updating the plaintext inverted index as EnIndex.sub.file=(w.sub.1, w.sub.2, . . . w.sub.p); S2, using a random number generator to establish a searchable encryption key K.sub.index=(K.sub.1, K.sub.2) and a symmetric encryption key K.sub.enc according to a security parameter k; using the searchable encryption key K.sub.index to encrypt the inverted index and construct a ciphertext index, and using the symmetric encryption key K.sub.enc to encrypt documents to be uploaded; S3, dividing the searchable encryption key K.sub.index into (K.sub.u, P.sub.u), sending a key group K.sub.u to an authorized user, and using P.sub.u as a server verification parameter to complete user authorization; S4, generating a search trapdoor when the user takes the key group K.sub.u and a keyword w to be searched as input, and submitting the search trapdoor to a cloud server; the cloud server verifying the trapdoor by encrypting the ciphertext index, and returning to the user a document sequence corresponding to matched encrypted documents and fuzzy keywords, when similar-form words of the search keyword are included in the document set, a document containing the search keywords is ranked before a document with its similar-form words in search results.

2. The full-text fuzzy search method for similar-form Chinese characters in the ciphertext domain according to claim 1, wherein the process of step S2 is as follows: S2.1, using the random number generator, according to the security parameter k, randomly generating a k-bit searchable encryption key K.sub.index=(K.sub.1, K.sub.2) and the symmetric encryption key K.sub.enc locally; S2.2, encrypting the inverted index EnIndex.sub.file=(w.sub.1,w.sub.2, . . . ,w.sub.p) as an index keyword by using K.sub.index, and the encryption of the index uses a chain structure:
w.fwdarw.Enc(flie.sub.1).fwdarw.Enc(flie.sub.2).fwdarw. . . . .fwdarw.Enc(flie.sub.x), when w=(w′, w.sup.1,w.sup.2, . . . ,w.sup.m) is a set of multiple similar-form words, for each similar-form word, firstly linking the document corresponding to the word, and then sequentially linking documents corresponding to other words, which ultimately generating an encrypted ciphertext index for all index keywords; S2.3, performing symmetric encryption operation on all the documents to be uploaded by using a symmetric encryption algorithm, and a symmetric encryption key is K.sub.enc, using the unique identifier set FILE(flie.sub.1,flie.sub.2, . . . ,flie.sub.n) to correspond to a ciphertext document, and then constructing a B+tree as an index into the unique identifier of the ciphertext document.

3. The full-text fuzzy search method for similar-form Chinese characters in the ciphertext domain according to claim 1, wherein the process of step S3 is as follows: S3.1, dividing K.sub.index into the key group K.sub.u and the server verification parameter P.sub.u by an exclusive OR operation; S3.2, sending the key group K.sub.u to the authorized user, and sending the server verification parameter P.sub.u to the cloud server, in order to verify the correctness of the search trapdoor; S3.3, when the data owner revokes the authority, requesting the server to delete the authentication parameter P.sub.u, at this time the search trapdoor generated using the key group K.sub.u whose authorization is revoked will be invalidated.

4. The full-text fuzzy search method for similar-form Chinese characters in the ciphertext domain according to claim 1, wherein the process of step S4 is as follows: S4.1, generating the search trapdoor by using the key group K.sub.u and the search keyword w, and submitting the trapdoor to the cloud server; S4.2, the cloud server using the verification parameter P.sub.u to XOR the search trapdoor, the XOR result matching the searchable ciphertext index, and the matching result is calculated to obtain a ciphertext unique identifier set (flie.sub.1,flie.sub.2, . . . , flie.sub.i), where i represents the number of files corresponding to the keyword, searches for the specified identifier in the B+tree to obtain an encrypted document set, and returns the encrypted document to the authorized user.

5. The full-text fuzzy search method for similar-form Chinese characters in the ciphertext domain according to claim 1, wherein the number of commonly used Chinese characters N is 3755.

Description

DESCRIPTION OF FIGURES

(1) FIG. 1 is a flowchart showing a full-text fuzzy search method for similar-form Chinese characters in the ciphertext domain disclosed in the present invention.

(2) FIG. 2 is a schematic diagram of an application structure showing a full-text fuzzy search method for similar-form Chinese characters in the ciphertext domain disclosed in the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

(3) In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part, but not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative efforts are within the scope of the present invention.

EMBODIMENTS

(4) This embodiment discloses a full-text fuzzy search method for similar-form Chinese characters in the ciphertext domain and proposes a full-text fuzzy search scheme for Chinese characters in the ciphertext domain in a semi-trusted cloud server, based on the symmetric searchable encryption and constructed similar-form character library; under non-secure channel, the scheme supports ciphertext search function based on symmetric searchable encryption scheme, order-preserving fuzzy keyword search for Chinese ciphertext, and multi-keyword Chinese ciphertext search;

(5) First, the data owner needs to construct a dictionary of similar-form Chinese characters and generate the keys needed for initialization, and then establish a plaintext inverted index for the documents that need to be uploaded to the cloud service, reconstruct the plaintext inverted index using the existing similar-form character dictionary, and finally encrypt the plaintext inverted index and the set of documents to be uploaded, and upload the encrypted index and the set of documents to the semi-trusted cloud server;

(6) The second part is the search part; authorized users generate search trapdoors through their authorized key groups and search keywords w and upload the search trapdoors to the cloud server through user query requests; the cloud server will perform calculation operations on the trapdoors and will match and iterate the calculated result and the encrypted index, and the obtained document set is finally returned to the requesting user;

(7) Finally, the user uses the document key to decrypt the content of the plaintext document.

(8) As shown in FIG. 1, a full-text fuzzy search method for similar-form Chinese characters in the ciphertext domain specifically comprises:

(9) S1, generate the inverted index, use the distributed search engine Lucene and the Chinese word segmentator IKAnalyzer to perform full-text segmentation on the uploaded document set, obtain the plaintext inverted index of the set of documents to be uploaded, construct the similar-form Chinese character dictionary library by analysing the commonly used Chinese characters, and use the similar-form Chinese character dictionary library to expand the plaintext inverted index of the set of documents to be uploaded;

(10) S2, data encryption, given the security parameter k, the data owner establishes the searchable encryption key K.sub.index=(K.sub.1,K.sub.2) according to the security parameter k, to encrypt and construct the ciphertext index; establish the symmetric encryption key K.sub.enc to encrypt documents to be uploaded; use the searchable encryption key K.sub.index to encrypt the inverted index obtained in step 2, and use the symmetric encryption key K.sub.enc to encrypt documents to be uploaded;

(11) S3, user authorization, the data owner divides the searchable encryption key K.sub.index into (K.sub.u, P.sub.u), sends K.sub.u to authorized users, and uses P.sub.u as a server verification parameter to complete user authorization;

(12) S4, search documents, the user takes the key group K.sub.u and the keyword w to be searched as input, generates the search trapdoor, and submits the search trapdoor to the cloud server; the cloud server verifies the trapdoor by encrypting the ciphertext index, and returned to the user the document sequence corresponding to matched encrypted documents and fuzzy keywords, if similar-form words of the search keyword are included in the document set, the document containing the search keywords will be ranked before a document with its similar-form words in search results.

(13) The process of step S1 is as follows:

(14) S11, establish the unique identifier set FILE(flie.sub.1, flie.sub.2, . . . , flie.sub.n) of documents to be uploaded, where n represents the number of documents to be uploaded;

(15) S12, use the distributed search engine Lucene and the Chinese word segmentator IKAnalyzer to perform full-text segmentation and filtering on the uploaded document set, and the result of the word segmentation result is (w.sub.1′, w.sub.2′, . . . , w.sub.p′), where p is the length of the inverted index, each document set is the inverted index EnIndex.sub.file=(w.sub.1′, w.sub.2′, . . . , w.sub.p′);

(16) S13, collect 3755 commonly used Chinese characters, establish a commonly used Chinese character dictionary, and expand it into a similar-form Chinese character dictionary library DICT by collecting and analysing the commonly used similar-form Chinese characters;

(17) S14, traverse the keyword w′ in the inverted index, if the keyword has the similar-form word, expand the w′ to w=(w′, w.sup.1, w.sup.2, . . . , w.sup.m) by using DICT, where (w.sup.1, w.sup.2, . . . , w.sup.m) is the set of similar-form words of w′, m represents the number of similar-form words of w; if w′ does not have a similar-form word, then w=w′;

(18) S15, update the plaintext inverted index as EnIndex.sub.file=(w.sub.1, w.sub.2, . . . , w.sub.p).

(19) The process of step S2 is as follows:

(20) S21, according to the security parameter k, the data owner randomly generates the k-bit searchable encryption key K.sub.index=(K.sub.1,K.sub.2) and the symmetric encryption key K.sub.enc locally;

(21) S22, encrypt the inverted index EnIndex.sub.file=(w.sub.1, w.sub.2, . . . , w.sub.p) generated in step 2 as the index keyword by using K.sub.index, and the encryption of the index uses the chain structure, which is w.fwdarw.Enc(flie.sub.1).fwdarw.Enc(flie.sub.2).fwdarw. . . . .fwdarw.Enc(flie.sub.x), when w=(w′, w.sup.1, w.sup.2, . . . , w.sup.m) is a set of multiple similar-form words, for each similar-form word, firstly link the document corresponding to the word, and then sequentially link documents corresponding to other words, which ultimately generates the encrypted ciphertext index for all index keywords;

(22) S23, perform symmetric encryption operation on all the documents to be uploaded by using the symmetric encryption algorithm, and the symmetric encryption key is K.sub.ene, use the unique identifier set FILE(flie.sub.1, flie.sub.2, . . . , flie.sub.n) to correspond to the ciphertext document, and then construct a B+ tree as the index into the unique identifier of the ciphertext document.

(23) The process of step S3 is as follows:

(24) S31, the data owner divides K.sub.index into a user key group K.sub.u and a server verification parameter P.sub.u by an exclusive OR operation;

(25) S32, the data owner sends the user key group K.sub.u to an authorized user, the authorized user generates the search trapdoor by using the key group K.sub.u and the search keyword w, and the data owner sends the server verification parameter P.sub.u to the cloud server, in order to verify the correctness of the user's search trapdoor;

(26) S33, when the data owner revokes the authority, requests to the server to delete the authentication parameter P.sub.u, at this time the search trapdoor generated using the user key group K.sub.u whose authorisation is revoked will be invalidated.

(27) The process of step S4 is as follows:

(28) S41, the authorized user generates the search trapdoor by using the key group K.sub.u and the search keyword w, and submits the trapdoor to the cloud server;

(29) S42, the cloud server uses the verification parameter P.sub.u to XOR the search trapdoor, the XOR result matches the searchable ciphertext index, and the matching result is calculated to obtain the ciphertext unique identifier set (flie.sub.1, flie.sub.2, . . . , flie.sub.i), where i represents the number of files corresponding to the keyword, searches for the specified identifier in the B+ tree to obtain the encrypted document set, and returns the encrypted document to the authorized user.

(30) FIG. 2 is a schematic diagram of an application structure showing a full-text fuzzy search method for similar-form Chinese characters in the ciphertext domain disclosed in the present invention, as shown in FIG. 2, in which,

(31) The data owner is used to generate the dictionary of similar-form Chinese characters that is used in the scheme, this dictionary determines the accuracy of full-text fuzzy inquiry of similar-form Chinese characters in the ciphertext domain; extract the full-text keywords for each document in the plaintext set, and according to the similar-form Chinese character dictionary, the similar-form words' fuzzy processing is performed on each keyword of each document; the document set to be uploaded is symmetrically encrypted, and the encrypted ciphertext index is generated by using fuzzy keywords and corresponding ciphertext documents; upload encrypted document set and ciphertext index to cloud server;

(32) The authorized user, when search documents, according to the keyword to be searched, encrypts multiple keywords by the authorized user's key group to generate a search trapdoor, and sends the trapdoor to the cloud server; during the search phase, the cloud service will do checking computation on the trapdoor, and the server will return the corresponding matched encrypted document set; if there is no document corresponding to the keyword or the authorized user key group is incorrect, the server will have no return value; finally, the authorized user downloads the matched ciphertext document set, and decrypts the document into the plaintext document set by using the document decryption key;

(33) The cloud server is configured to store the ciphertext document and the encrypted ciphertext index uploaded by the data owner; during the search phase, obtains the trapdoor information of the authorized user, calculates the transformation and iterates the transformed result in the ciphertext index, then stores the unique identifier of each document obtained in the output set; transmits all the ciphertext sets corresponding to the document unique identifier to the authorized user, and no response to the user if the output set is empty.

(34) In summary, the present invention mainly comprises generating the similar-form Chinese characters dictionary, document full-text segmentation, document keyword extension, document encryption, and completing the fuzzy search in the ciphertext domain. In the initialization process, the data owner constructs the similar-form Chinese characters dictionary library by collecting the common similar-form Chinese characters, then establishes the plaintext inverted index for the documents that need to be uploaded to the cloud service, and reconstructs the plaintext inverted index by using existing similar-form words dictionary library, then uses the random number generator to generate the keys required for initialization, finally encrypts the plaintext inverted index and the document set to be uploaded, and uploads the encrypted index and the encrypted document set to the semi-trusted cloud server; the authorized user generates search trapdoors with the authorized keys and search keywords w, when the user request the query, the search trapdoors are uploaded to the cloud server; the cloud server performs calculation operations through trapdoors, matches and iterates the calculation results with the encrypted index, and finally returns the document set of the search results to the requesting user; finally the user decrypts the document using the document key.

(35) The above embodiment is a preferred embodiment of the present invention, but the embodiment of the present invention is not limited by the above embodiment. Any other changes, modifications, substitutions, combinations, and simplifications made without departing from the spirit and principle of the present invention, all should be equivalent substitute methods, and included in the protection scope of the present invention.