DOCUMENT BASED QUERY AND INFORMATION RETRIEVAL SYSTEMS AND METHODS

20170322930 · 2017-11-09

Inventors

Jacob Michael Drew (Dallas, TX, US)

Cpc classification

International classification

Abstract

Disclosed herein are systems and methods for document based query and information retrieval which rapidly locate similar documents within a document corpora providing a document based search result to the search initiator including one or more estimated measures of similarity for each search result item and appropriate search result document metadata. After providing document based similarity approximation search results, the system also rapidly retrieves and determines more accurate measures of similarity, including the relevant document terms and term statistics used to determine an exact measure of similarity, between the document based query document term collection and individual search result document term collections using one or more computing devices, that are application and platform independent, participating in a distributed multicore processing environment. One or more web clients transmit document based query and information retrieval requests to one or more restful services which provide the document based search results to the search initiator via stateless HTTP responses and requests. Dimensionality reduction techniques are used to limit the total number of similarity approximations and document term data similarity calculations performed during both nearest neighbor pre-processing and document based searches. The systems and methods disclosed include document based query and information retrieval embodiments providing document search results to the search initiator which include the details supporting exactly how two documents are, in fact, similar using a given particular document based query and a specific measure for document based similarity.

Claims

1. A method for identifying documents within a document corpus empirically determined to be similar to a document of a document-based search query and for identifying the empirically determined similarities, the method comprising: accessing a plurality of related documents in a document corpus; creating a document fingerprint for each stored document by: extracting document data from each stored document's content; making empirical measurements on the extracted document data for each stored document, the empirical measurements representing unique characteristics of each stored document's content; receiving a document-based search query from a user for identifying one or more documents from the document corpus empirically determined to be similar to the document of the received document-based search query; determining one or more documents from the document corpus to be empirically similar o the document of the received document-based search query based on exactly matching one or more of the empirical measurements in the document fingerprints of documents in the document corpus with corresponding one or more empirical measurements made for a document fingerprint created for the document of the document-based search query; providing a document-based search result to the user, the search result comprising documents from e document corpus determined to be similar to the document of the document-based search query based on the exact matching one or more empirical measurements of the document fingerprints; and identifying to the user the exactly matched one or more empirical measurements and associated extracted data from the document fingerprint of the document of the document-based search query and the document fingerprint of each document comprising the search result determined to be similar.

2. A method in accordance with claim 1, further comprising creating the document fingerprint for the document of the received document-based search query upon receipt of the search query.

3. A method in accordance with claim 1, wherein the document of the received document-based search query is selected from the document corpus by the user.

4. A method in accordance with claim 3, wherein each document in the document corpus comprises a unique document identifier, wherein the received document-based search query comprises receiving one or said unique document identifiers from the user, and wherein providing a document-based search result comprises providing a list of said unique document identifiers corresponding to the documents comprising the document-based search results.

5. A method in accordance with claim 1, wherein the made empirical measurements comprising the document fingerprint for each document include lossy compression and/or dimensionality reduction representing each document's unique characteristics within a collection of numbers or bits.

6. A method in accordance with claim 5, wherein determining the one or more documents from the document corpus to be empirically similar to the document of the received document-based search query further comprising generating a score corresponding to a degree of the empirically determined similarity.

7. A method in accordance with claim 5, wherein the dimensionality reduction comprises performing one or more hashing functions on each document's unique characteristics to generate a corresponding hash value from each of the performed one or more hashing functions.

8. A method in accordance with claim 7, wherein one or more hashing functions comprises MinHashing and/or Locality Sensitive Hashing.

9. A method in accordance with claim 5, wherein the dimensionality reduction comprises one or more processes selected from the group consisting of: Random Sampling; Principal Component Analysis; Kernel Principal Component Analysis; Linear Discriminant Analysis; Quadratic Discriminant Analysis; Generalized Discriminant Analysis; Spectral Methods for Dimensionality Reduction; Bit sampling for Hamming distance; and Random Project Dimensionality Reduction.

10. A method in accordance with claim 1, storing the extracted document data and the empirical measurements for each stored document in a repository.

11. A method in accordance with claim 10, further comprising empirically determining similarities between two or more of the stored documents of the document corpus based on matching one or more of the empirical measurements in the document fingerprint(s) of one or more of the stored documents with corresponding one or more empirical measurements in the document fingerprint(s) for another one or more of the stored documents prior to receiving the search query.

12. A method in accordance with claim 11, further comprising storing document neighbor data regarding the empirically determined similar two or more stored documents and the matched one or more empirical measurements used to determine their similarity.

13. A method in accordance with claim 1, wherein the extracted document data is data regarding one or more selected from the group consisting of: character(s) occurring in a document; term(s) occurring in a document; collection(s) of terms occurring in a document; metadata of a document; metadata occurring in a document; metadata occurring in the document corpus; metadata occurring in a document with regards to all documents in the document corpus; and statistics regarding one or more of character(s) occurring in a document, term(s) occurring in a document, collection(s) of terms occurring in a document, metadata of a document; metadata occurring in a document; metadata occurring in the document corpus and metadata occurring in a document with regards to all documents in the document corpus.

14. A method in accordance with claim 1, wherein identifying to the user the exactly matched one or more empirical measurements and associated extracted data further comprises presenting to the user the document of the document-based search query and one or more of each document comprising the search result, each said presented document having visual indicators illustrating one or more of the exactly matched one or more empirical measurements and associated extracted data therein.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0049] For a more complete understanding of the present invention and its advantages, reference is now made to the following description and the accompanying drawings, in which:

[0050] FIG. 1 illustrates a block diagram of a Web Based Document Query and Information Retrieval System;

[0051] FIG. 2 illustrates a block diagram of a Document Processing System;

[0052] FIG. 3 illustrates a block diagram of a Document Similarity Pre-Processing System;

[0053] FIG. 4 illustrates a block diagram of a Document Fingerprinting System;

[0054] FIG. 5 illustrates a block diagram of a Known Document Neighbors Request;

[0055] FIG. 6 illustrates a block diagram of a Known Document Terms Request;

[0056] FIG. 7 illustrates a block diagram of a Known Document Terms Request and Match;

[0057] FIG. 8 illustrates a block diagram of a Known Document Terms Match Request (Server and Client);

[0058] FIG. 9 illustrates a block diagram of a Known Document Content Request; and

[0059] FIG. 10 illustrates a block diagram of a Unknown Document Terms Request and Match.

DETAILED DESCRIPTION

[0060] The following detailed description includes exemplary embodiments of the invention disclosed and reference is made to the accompanying figures that form a part hereof. The figures here are shown to illustrate specific embodiments in which the invention may be practiced. Please understand that other embodiments will be utilized which may include structural changes and modifications made without departing from the scope of the present invention.

[0061] FIG. 1, a Web Based Document Query and Information Retrieval System, is indicated generally by reference 10. The Web Based Document Query and Information Retrieval System 10 illustrates a hardware and operating environment in which this invention's embodiments may be practiced. The following description of FIG. 1 provides a brief and more general description of the computing environment and hardware suitable for this invention's implementation.

[0062] While not required, embodiments of this invention are described in the context of program instructions which are executed by a computer. This may include program modules executed on a personal computer, server, mainframe, or other suitable computing device. Program modules may include objects, data structures, libraries, packages, or other components necessary to perform computing tasks, support abstract data types, or any other number of functions required for the given computing task at hand.

[0063] Individuals skilled in the art will appreciate that embodiments of this invention may be practiced using other computer, system, hardware, and network configurations, including hand-held devices, multiprocessor systems, embedded systems, programmable electronics, personal computers, laptops, minicomputers, mainframe computers, and other computing devices. System 10 embodiments might also be practiced in a distributed computing and/or web based cloud computing environment where remote processing tasks are performed on devices and systems located in any number of geographic locations linked through a communications network.

[0064] The system 10 includes one or more Document Processing 30 system instances operable to processing one or more sets of Input Data 20. For convenience, any number of individual sets of input data are indicated generally by reference ID1 0,1,2-ID1 N-1. Input Data 20 may comprise any type of digital data conducive to content extraction in the form of text data, including all text characters supported by ASCII, Unicode, EBCDIC, UTF-8, UTF-16, or any other form of character encoding in which text data may exist. Likewise, Input Data 20 documents may include any form of text and or digital content. Certain embodiments for example, may process individual ID1 0,1,2-ID1 N-1 input dataset instances including web pages, news articles, programs, spreadsheets, database tables, patent documents, business documents or any other digital content which a search initiator might conceive a desire to use for the purposes of document based query and information retrieval. The Document Processing 30 system is fully illustrated in FIG. 2 and described in detail within the sections of the detailed description referencing FIG. 2.

[0065] Now referring to FIG. 1, multiple individual instances of the Document Processing 30 system are referenced by DP 0, 1, 2-IDP N-1. In some System 10 embodiments, a plurality of Document Processing 30 system instances may be used to receive Input Data 20 instances ID1 0,1,2-ID1 N-1. Likewise, Document Processing 30 system instances DP 0, 1, 2-DP N-1 may execute concurrently as multiple processes, as a single processing instance operating on multiple processing cores in parallel, or any combination of these two. Furthermore, Document Processing system instances DP 0, 1, 2-DP N-1 may execute or operate on multiple computing devices in any number of processing configurations suitable to meet the performance, redundancy, or scalability needs of a particular invention embodiment.

[0066] Upon receiving input data instances ID1 0,1,2-ID1 N-1 one or more Document Processing 30 system instances extract displayed text from each piece of digital content received. The Document Processing 30 system typically includes processing steps which populate portions of one or more Document Repository 40 data collections DR 0, 1, 2-DR N-1, including Document Metadata 50, Document Content 60, and Term Repository 70, including Term Statistics 80 and Term Data 90.

[0067] In certain invention embodiments, Document Repository 40 data structures may be maintained within one or more relational database management systems, NOSQL database systems, in-memory database systems, or a collection of hash tables or other data structures common to one skilled in the art. In yet another embodiment, one or more file servers supported by a disk drive or solid state storage media, network attached storage, direct attached storage, storage area networks, or any combination thereof may be used to support portions of Document Repository 40.

[0068] In one embodiment supporting the US patent corpora, a file server may maintain folders referenced by a known document key, typically a patent number, which contains data collections and entries for Document Content 60, and Term Repository 70, including Term Statistics 80 and Term Data 90. In this embodiment, patent text, extracted term data collections, and term statistics may be directly accessed from a plurality of files maintained on the file server simply by resolving the correct file server path using an appropriate document key and file name for each required repository file.

[0069] The Document Repository 40 may also comprise a Score Repository 100 which may include Document Neighbors 120 and Document Fingerprints 110 data structures. The Score Repository 100 is typically populated and updated by one or more Similarity Pre-processing 85 instances SP 0, 1, 2-SP N-1 which may execute concurrently as multiple processes, as a single processing instance operating on multiple processing cores in parallel, or any combination of these two. Furthermore, Similarity Pre-processing 85 system instances SP 0, 1, 2-SP N-1 may execute or operate on multiple computing devices in any number of processing configurations suitable to meet the performance, redundancy, or scalability needs of a particular invention embodiment.

[0070] Similarity Pre-processing 85 instances receive unprocessed input from Term Repository 70's Term Data 90 collections processing each unique term entry consistent with Similarity Pre-processing 85 processing requirements. Similarity Pre-processing 85 instances generate a document sketch using each Term Data 90 collection and placing document sketches in the Document Fingerprints 110 data structure. Certain embodiments may include an optional document banding processing step adding a plurality of band values to each document sketch.

[0071] Preferred embodiments of the Document Fingerprints 110 data structure are used for the purposes of dimensionality reduction during nearest neighbor searches. This processing is referenced in FIG. 3's Create Document Fingerprints 320 process and explained in great detail within the FIG. 4 Document Fingerprinting System 430. Such searches may occur during various stages within the Web Based Document Query and Information Retrieval System 10. In one example embodiment, dimensionality reduction via band values is used to exponentially reduce the total number of similarity comparisons while populating the Document Fingerprints 110 data structure. In yet another example, the FIG. 10 Unknown Document Terms Request and Match 700 generates an unknown document sketch to exponentially reduce the number of similarity comparisons required to perform unknown document based search and information retrieval operations.

[0072] Each Term Data 90 collection processed during Similarity Pre-processing 85 creates an entry within the Document Neighbors 110 repository. The Document Neighbors 110 repository contains a document key entry for each known document within the system which includes the top n most similar document references collection as the value for each entry. The Document Neighbors 110 data structure rapidly services requests for both known and unknown document keys and document sketches. In some embodiments, known document keys are provided as illustrated in the FIG. 5 Known Document Neighbors Request 490. During such a transaction, the similar document references collection is directly accessed via the known document key rapidly providing access to search result candidates for the known document key provided. In other embodiments, optional document sketch band values within the Document Neighbors 110 data structure are matched to one or more bands within a known or unknown document sketch to dramatically reduce the number of similarity comparisons to identify document candidates for similarity score calculations and eventual nearest neighbor candidates.

[0073] In the preferred embodiment, Document Repository 40 is accessible to one or more Restful Service 130 instances RS 0, 1, 2-RS N-1 exposing the six exemplary restful requests illustrated in FIG. 5-FIG. 10 to any number of Web Clients 140 instances WC 0, 1, 2-WC N-1. Restful Services 130 are accessible via the HTTP or HTTPS protocols providing stateless transaction oriented services to a web client, web application or both. In other embodiments, Document Repository 40 may be accessible to one or more computer applications servicing requests to an application user via a local communication network, virtual private network, or simply on a local computing device.

[0074] In one System 10 embodiment supporting the US patent corpora, the search initiator selects or uploads a known or unknown patent document via the Web Client 140. In turn, Web Client 140 provides a HTTP request, including the appropriate document key and/or sub keys, to the Restful Services 130 via the appropriate services URL. Restful Services 130 provides an HTTP response which includes content representative of the HTTP request initiated by the Web Client 140 instance.

[0075] FIG. 2 refers to an exemplary and detailed illustration of a Document Processing System embodiment generally indicated by reference 150. Multiple individual instances of the FIG. 1 Document Processing 30 and the detailed FIG. 2 Document Processing System 150 are referenced by DP 0, 1, 2-DP N-1. One or more Document Processing System 150 instances may operate within the context of a Web Based Document Query and Information Retrieval System 10 embodiment.

[0076] Document Processing System 150 instances typically generate term data collections for each input data instance ID1 0,1,2-ID1 N-1 received, and term data collections are typically partitioned by a known document key within the Term Data 90 data structure. Document Content Extraction 160 has extracts appropriate document content for further downstream processing. In an embodiment supporting the US patent corpora for example, patent text HTML and relevant patent metadata is extracted from the raw patent XML data provide by the USPTO bulk data system during the Document Content Extraction 160 processing step. In addition, individual patent documents may be separated into individual input data instances ID1 0,1,2-ID1 N-1 during the Document Content Extraction 160 processing step. Document Content Extraction 160 populates other patent metadata, images, and text into Document Metadata 50 and Document Content 60.

[0077] The Rendered Text Extraction 170 processing step transforms targeted document content into displayed or human readable text. For instance, an embodiment processing web pages may remove HTML tags and other non-relevant data from text prior to further downstream text processing operations using a Rendering Service 180 process.

[0078] In a preferred embodiment supporting the US patent corpora, HTML formatted patent text and claims are extracted from the raw XML data contained in a USPTO bulk data file during the Document Content Extraction 160 processing step. Next, Rendered Text Extraction 170 executes one or more “headless” browser instances in-memory during the Rendering Service 180 process to transform the patent's HTML to human readable patent text. The patent specific HTML is saved to disk by Rendered Text Extraction 170 in the Document Content 60 repository, and a headless browser instance is directed to render and transform the HTML file using a file:// URL. Once the browser renders the HTML, the transformed patent text is then saved for further downstream document pre-processing in the Document Content 60 repository.

[0079] In other Rendering Service 180 embodiments, patent specific HTML may be passed in memory directly to the headless browser using no file, and only the transformed browser text is saved. While the preferred embodiment may use browser rendering to remove HTML tags and HTML specific entity definitions such as: (&amp, &lt, &gt etc.) from patent specific HTML, other embodiments may simply use regular expressions, lookup tables, html processing packages such as HTML agility pack or any combination of the aforementioned tools or other similar tools to process and convert HTML specific input data to text. Yet other embodiments may receive text documents as input data and require no HTML specific or other specific input data transformation processing at all to extract the document's text.

[0080] Term Extraction 190 receives document text as input from the Rendered Text Extraction 170 process and applies any number of term processing techniques to the raw document text received. Terms produced during Term Extraction 190 processing may include extractions of any arrangements or portions of document characters and/or words including Paragraph Detection 200, Sentence Detection 210, Tokenization Processing 220, N-gram Production 230, Natural Language Processing 240, Noun Phrase Production 250, K-mer Production 260 (chunks of overlapping document characters produced using a “sliding window”), or any other arrangement of document characters conceived by the application programmer or data scientist to be used for subsequent document similarity comparison or any aspect of document based query and information retrieval processing.

[0081] Individual extractions of any unique arrangements or portions of document characters and/or words during Term Extraction 190 processing are referred to as “Terms” in the context of this invention. The Term Extraction 190 process may use any combination of the aforementioned text processing techniques to generate document terms and document term collections. In a preferred embodiment supporting the US patent corpora, transformed patent text input data is processed by one or more Term Extraction 190 processing instances. Sentence detection is performed on transformed patent text dividing the text into sentences. Patent sentences are stored within the Document Content 60 repository and partitioned by patent and claims specific text. Next, any number of additional term collections are generated including N-gram Production 230 processes which create one or more term collections comprising n-grams of varying lengths. In some embodiments, Natural Language Processing 240 may be performed to tag each sentence's part of speech for the subsequent identification of noun phrases within a Noun Phrase Production 250 process.

[0082] Any number of programming languages, libraries, packages, or custom code modules or combination thereof may be used to perform Paragraph Detection 200, Sentence Detection 210, Tokenization Processing 220, N-gram Production 230, Natural Language Processing 240, Noun Phrase Production 250, or K-mer Production 260 such as using NLP.net or Sharp.net within the C# programming language. Yet other embodiments may use the Python programming language in combination with OpenNLP, while other embodiments may include original custom processing methods in any programming language to perform such activities. Furthermore, all models used within Natural Language Processing 240 such as tokenization, sentence boundary detection, part of speech tagging and others may be generated using a common document corpus such as the Brown Corpus or others. However, the preferred embodiment will use custom models within Natural Language Processing 240 which are appropriate for the type of documents being processed. For example, embodiments processing US patents may use a sentence boundary detection model which is specific to sentences contained within US patent documents.

[0083] Term Frequency 270 processing tracks and updates the number of times each unique term within a term data collection occurs during Term Extraction 190 processing. After Term Extraction 190 processing is completed for a particular document, each term data collection is saved by the Save Term Data Collections 280 process which may comprise the serialization of term data collection objects to disk within the Term Data 90 repository, inserting Term Data Collections 280 into a relational database table or some other data store. In some embodiments, the Save Term Data Collections 280 processing step operates concurrently with Term Extraction 190 processing updating saving each Term within the appropriate Term Data 90 repository as it is encountered during Term Extraction 190 processing.

[0084] In the preferred embodiment, Term Statistics 80 is populated with relevant statistics related to the entire document corpora. The Term Statistics 80 data structure is typically a collection of key-value pairs contained within a database, hash table, NOSQL key-value data store, or data structure which is appropriate to one skilled in the art. Term Statistics 80 data typically comprises corpora level data such as term document frequency (i.e. the number of documents a term appears in), unique term totals, corpora total document counts, and any other summary level statistics relevant to subsequent similarity calculations or required for a particular embodiment's practice. Upon completion of Term Extraction 190 instance processing, each document's one or more term data collections are placed within the Term Data 90 repository and associated with the appropriate document key and/or sub keys. A Term Stats Processing 290 instance processes term data collections updating the appropriate entries within Term Statistics 80 repository.

[0085] In accordance with one invention embodiment, Term Stats Processing 290 updates Term Document Frequency entries within the Term Statistics 80 repository for each unique term encountered within each unique document. Other invention embodiments may track and update any number of other summary level statistics within the Term Statistics 80 repository about the document corpora, terms contained within the document corpora, or any other statistics conceived by the application programmer to support document based query and information retrieval.

[0086] Some embodiments may maintain multiple separate or partitioned Term Statistics 80 repository data structures. For the purposes of scalability and performance, Term Statistics 80 repository partitions may reside on multiple databases tables or servers for a very large document corpora. However, other embodiments may maintain multiple Term Statistics 80 repository data structures to house multiple Term Document Frequency collections or statistics which are required for a particular document based query and information retrieval process. For example, multiple Term Statistics 80 repository data structures may be used to maintain separate term document frequency collections for both patent text and patent claims specific text in an embodiment supporting the US patent corpora.

[0087] Once the Term Statistics 80 repository has been updated, Term Data Stats Update 300 processing may perform additional pre-processing update operations on the term data collections to update unique terms with document corpora specific term level statistics. For example, unique terms within a term documents collection may be updated to include the Inverse Document Frequency IDF or the Term Frequency-inverse Document Frequency TF-IDF which may only be calculated using details about the entire document corpora which are unknown during Term Extraction 190 processing.

[0088] FIG. 3 refers to an exec exemplary and detailed illustration of a Document Similarity Pre-Processing System embodiment generally indicated by reference 310. One or more Document Similarity Pre-Processing System 310 instances may operate within the context of a Web Based Document Query and information Retrieval System 10 which are referenced as SP 0, 1, 2-SP N-1. A Document Similarity Pre-Processing System 310 instances are also referenced within FIG. 1 and FIG. 3 as Similarity Pre-Processing 85, SP 0, 1, 2-SP N-1.

[0089] The Term Repository 70 provides both Term Statistics 80 and. Term Data 90 data structures as input data for Similarity Pre-Processing 85 instances. The Create Document Fingerprints 320 processing step receives as input document specific term data collections from the Term Data 90 data structure and possibly other data from the Term Statistics 80 repository as necessary. Detailed processing steps for the Create Document Fingerprints 320 process are disclosed within FIG. 4. One or more instances of the Create Document Fingerprints 320 process generate document sketches using one or more document specific term data collections. The document sketches are a form of lossy compression and/or dimensionality reduction which succinctly represents a single document's unique characteristics within a collection of numbers or bits which is considerably smaller than the number of entries within the input term document collation. Completed fingerprints for each document are placed into the Document Fingerprints 110 repository during the Create Document Fingerprints 320 step. In some instances however, retaining a document's sketch may not be required. For example, certain embodiments performing unknown document searches may opt not to store the sketches for each unknown document search.

[0090] The Neighbor Identification 330 process receives as input document specific term data collections from the Term Data 90 data structure and possibly other data from the Term Statistics 80 repository as necessary performing Neighbor Candidates Processing 340 steps which utilize document sketches and optional band values within the Document Fingerprints 110 repository to perform rapid location of each known document's nearest neighbors. A document's nearest neighbors collection comprise other document references within the known documents corpora which are considered to be highly similar.

[0091] In embodiments which contain smaller total numbers of known documents or required the absolute highest estimates of similarity for document neighbors, the Create Document Fingerprints 320 step and the Document Fingerprints 110 repository might be eliminated altogether. However, embodiments containing very large numbers of known documents or requirements for very rapid searches in or population of Document Neighbors 120, multiple layers of levels of document sketches might exist within each Document Neighbors 120 entry.

[0092] In accordance with one embodiment of the invention supporting the US patent corpora, both minhashing and locality sensitive hashing operations are utilized to create each Document Neighbors 120 entry. The minhashing process described in detail within FIG. 4 is used to perform a first dimensionality reduction and locality sensitive hashing is also used to generate one or more band values where similar documents are “hashed into” and will receive a similar band value. These band values are used as a second dimensionality reduction to further reduce complexity during nearest neighbor searches.

[0093] When Neighbor Candidates Processing 340 determines which document's pairwise similarity should be calculated only documents with an acceptable level of matching band values are selected for Stage 1 of Neighbor Candidates Processing 340. Next in Stage 2, minhash signatures for documents with sufficient matching band values are compared to approximate each document's pairwise similarity. Finally, the top n most similar document approximations may be selected for the top n most similar document references collection and placed directly in the Document Neighbors 120 repository.

[0094] In certain embodiments, Neighbor Candidates Processing 340 may provide each top n most similar document references collection to Comparison Term Data 350 for further processing. In this embodiment, Comparison Term Data 350 retrieves each document reference's term data collection providing them to the Similarity Engine 360 where one or more Jaccard Similarity 370, Weighted Jaccard Similarity 380, TFIDF Cosine Similarity 390, IDF Similarity 400, TFIDF Similarity 410, Weighted TFIDF Similarity 420, or any other similarity metric conceived by one skilled in the art for comparing term document collections for similarity while practicing this invention. Document Neighbors 120 top n most similar document references collections are then updated with the final similarity values determined by Similarity Engine 360 processing. In some embodiments, only similarity approximations are generated from document sketches during similarity pre-processing. It is only during actual match requests when actual similarities are determined by a Restful Service 130 or Web Client 140 processing.

[0095] FIG. 4, a Document Fingerprinting System, is indicated generally by reference 430. For convenience, any number of individual instances of a Document Fingerprinting System 430's Create Document Fingerprints 320 process are indicated generally by reference CDF 0, 1, 2-CDF N-1. Create Document Fingerprints 320 processing typically performs dimensionality reduction for the purposes of nearest neighbor search. The Term Repository 70 provides term data collections from the Term Data 90 data structure as input for Create Document Fingerprints 320 processing. In certain instances, such as the Unknown Document Terms Request and Match 700 which is illustrated in FIG. 10, a Document Processing 30 instance may also provide term data collections as input to Create Document Fingerprints 320. Fingerprint Producer 440 includes a Skip Dups Set 450 and a plurality of Random Hashing Functions 460. For convenience, individual hashing function instances are referenced by RHF 0, 1, 2-RHF N-1. Fingerprint Producer 440 process performs dimensionality reduction on individual term data collections by passing each term data term within the term data collection through each of the Random Hashing Functions 460 generating a hash value from each hashing function instance RHF 0, 1, 2-RHF N-1.

[0096] In one example embodiment supporting the US patent corpora, Random Hashing Functions 460 includes 100 random hashing functions RHF 0, 1, 2-RHF 99. Each term within a term document collection for a single patent application is passed first checked within Skip Dups Set 450 to ensure this has not been previously hashed for the same document. In the preferred embodiment, term document collection's will only contain unique terms and the Skip Dups Set 450 will not be required. Each unique term within the term data collection is passed through the 100 random hashing functions RHF 0, 1, 2-RHF 99 while the Fingerprint Producer 440 maintains only the minimum hash values produced by RHF 0, 1, 2-RHF 99 resulting in a minhash signature containing 100 integers. In the preferred embodiment, multiple term data collections may be processed in parallel on multiple processors and/or computing devices processing multiple term data collections simultaneously. Minhash Collections 470 may be a thread-safe collection capable maintaining multiple minhash signatures for multiple term data collections simultaneously. Some embodiments may use optimal thread-safe, lock free, in-memory data structures for the Minhash Collections 470 while other embodiments may use locking strategies with more common data structures to house Minhash Collections 470.

[0097] In certain embodiments, each completed minhash signature is accessed from the Minhash Collections 470 data structure by the Optional Banding Process 480. Additional banding values are added to each completed minhash signature by generating a single hash from each of the minhash values within a particular band to produce a single band value for each band. For example, the minimum hash values generated by RHF 0, 1, 2-RHF 99 may be divided into 5 bands of 20 minhash values each. For the first band, a single hash value is produced using minhash values RHF 0-RHF 19 to produce the value for band 1. The final minhash signature now contains 100 minhash values plus 5 additional banding values for a total of 105 values. Final minhash signatures are placed into he Document Fingerprints 110 repository by the Create Document Fingerprints 320 process.

[0098] The exemplary Web Based Document Query and Information Retrieval System 10 illustrated in FIG. 1 supports 6 primary restful requests in the service of web based document queries and information retrieval. Each of these restful requests are provided by one or more Restful Services 130. Typically, one or more Web Clients 140 represent a search initiator performing one or more aspects of web based document queries and information retrieval. Each of the 6 primary restful requests are facilitated via stateless HTTP requests from Web Clients 140 which are provided HTTP responses from the Restful Services 130.

[0099] FIG. 5, a Known Document Neighbors Request, is indicated generally by reference 490. The Known Document Neighbors Request 490 begins with a HTTP request generated by a Web Client 140. The Restful Service 130 receives a HTTP request containing Document Key Request 500 including a known document key. In the preferred embodiment supporting the US patent corpora, the known document key may represent a known US patent number and optional sub-key indicating to perform a patent text, claims specific text, or patent full text document based query. The Document Key Request 500 processing retrieves the known document key's top n most similar document references collection from the Document Neighbors 120 repository. Nearest Neighbors 510 currently includes all document references contained within the top n most similar document references collection, but may perform Optional Document Filtering 520 to remove any documents which are not relevant for the search. Optional Document Filtering 520 may utilize additional data provided within the Document Key Request 500 in combination with the Document Metadata 50 to filter and reduce records included in the document key's top n most similar document references collection. Finally a metadata based search result is included in Metadata Response 530 which is provided in the HTTP response generated by Restful Services 130 and transmitted back the search initiator's Web Client 130.

[0100] In some embodiments document based query search results may include very limited document metadata such as the document key, a US patent application number in some embodiments for example, the document title, author, inventor, assignee, and a brief selection of the document text. However, other embodiments may return a much more robust document based search result including estimated or actual document similarity scores, pre-fetched document term sets for some or all of the documents within a search result, and other document content such as a pdf or other files containing all or portions the document's text.

[0101] FIG. 6, a Known Document Terms Request, is indicated generally by reference 540. The Known Document Term Request 540 begins with a HTTP request generated by a Web Client 140. The Restful Service 130 receives a HTTP request containing Document Key Term Request 550 including a known document key. In the preferred embodiment supporting the US patent corpora, the known document key may represent a known US patent number and optional sub-key indicating to perform a patent text, claims specific text, or patent full text document based query. The Document Key Term Request 550 processing retrieves the term data collection for the known document key provided from the Term Data 90 repository.

[0102] Document Terms Collection 560 currently includes all document term is contained within the document terms collection, but may perform Optional Term Filtering 570 to remove any terms which are not relevant for the request. For example, terms below a certain TFIDF score within the terms collection may be filtered. Optional Term Filtering 570 may utilize additional data provided within the Document Key Term Request 550 in combination with additional data within the term document collection to filter and reduce term records included in the document key's term data collection. Finally the appropriate term data collection is provided in Terms Response 580 which is included in the HTTP response generated by Restful Services 130 and transmitted back the search initiator's Web Client 130.

[0103] FIG. 7, a Known Document Terms Request and Match request, is indicated generally by reference 590. The Known Document Terms Request and Match 590 begins with a HTTP request generated by a Web Client 140. The Restful Service 130 receives a HTTP request containing Match Terms Request 600 including at least two known document keys. Next, a server-side Document Key Term Request 550 is performed for both each of the requested known document keys. Match Term Operations 610 receives at least two term data collections for at least two known document keys. The term data collections are matched or intersected to obtain all matching terns between the two term document collections. As previously described, certain embodiments may include additional matching operation steps during the Match Term Operations 610 process.

[0104] Optional Term Filtering 570 may utilize additional data provided within the Match Terms Request 600 in combination with additional data within the term document collection to filter and reduce term records included in the Match Term Operations 610 resulting matched term data collection. Finally the appropriate term data collection is provided in Matched Terms Result 620 which is included in the HTTP response generated by Restful Services 130 and transmitted back the search initiator's Web Client 130.

[0105] FIG. 8, a Known Document Terms Match Request, is indicated on the server side generally by reference 630 and on the client side generally by reference 640. The Known Document Terms Match Request 630 begins with a HTTP request generated by a Web Client 140. The Restful Service 130 receives a HTTP request containing Match Terms Request 600 including at least two term document collections as input. The Known Document Terms Match Request 640 begins at Web Client 140 with a Match Terms Request 600 including at least two term document collections as input.

[0106] In the Known Document Terms Match Request 640 embodiment, the Web Client 140 is assumed to already have at least two term document collections locally residing within the Web Client 140. For example, perhaps one user has reviewed term document collections for two patents of interest and now wants to compare them. The Known Document Terms Match Request 640 presents functionality to perform this match operation on the client side.

[0107] Match Term Operations 610 receives at least two term data collections for at least two known document keys. The term data collections are matched or intersected to obtain all matching terms between the two term document collections. As previously described, certain embodiments may include additional matching operation steps during the Match Term Operations 610 process. Optional Term Filtering 570 may utilize additional data provided within the Match Terms Request 600 in combination with additional data within the term document collection to filter and reduce term records included in the Match Term Operations 610 resulting matched term data collection. Finally the appropriate term data collection is provided in Matched Terms Result 620 which is included in the HTTP response generated by Restful Services 130 and transmitted back the search initiator's Web Client 130.

[0108] FIG. 9, a Known Document Content Request, is indicated generally by reference 650. The Known Document Content Request 650 begins with a HTTP request generated by a Web Client 140. The Restful Service 130 receives a HTTP request containing Document Content Request 660 including a known document key. In the preferred embodiment supporting the US patent corpora, the known document key may represent a known US patent number and optional sub-key indicating o perform a patent text, claims specific text, or patent full text document based query. The Document Content Request 660 processing retrieves the appropriate Document Content 60 repository file which simply resolves the appropriate file path for the required document content file such as: //US123456/ContentRepository/patentUS123456.pdf.

[0109] Once the document content is retrieved, it could be further subset during Content Subset Result 670 processing or post-processed during Optional Post Processing 680 as part of a content processing operations occurring on either the client or server side. In one example embodiment, patent claims text retrieved from the content repository is tagged by a natural language processing engine to extract and provide all noun phrases contained within the claims text. Yet in another embodiment, a patent's pdf formatted document is modified to highlight a set of matching terms provided as input with the content request. In this embodiment, these terms may or may not have originated from the same patent included in the content request. Finally, the Document Content Result 690 is included in the HTTP response generated by Restful Services 130 and transmitted back the search initiator's Web Client 130.

[0110] FIG. 10, an Unknown Document Terms Request and Match, is indicated generally by reference 700. The Unknown Document Terms Request and Match 700 begins with a HTTP request generated by a Web Client 140. The Restful Service 130 receives a HTTP request containing Match Terms Request (No Key) 710 including either unknown document text or a unknown document term data collection extracted on the client side. In the case of unknown document text, Document Processing 30 is performed to generate the unknown document term data collection. Next, the Create Document Fingerprints 320 process performs dimensionality reduction including optional Document Band Assignment 720 and Optional Band Filtering 730 as previously described prior to performing Neighbor Candidates Processing 340 to identify the unknown document's top n most similar document references collection.

[0111] In certain embodiments, Comparison Term Data 350 processing in combination with Similarity Engine 120 are utilized to determine the exact document similarities for each document within the top n most similar document references collection. However, other embodiments may choose to provide the similarity approximations generated during Neighbor Candidates Processing 340 within top n most similar document references collection. In such embodiments, exact similarity values are only determined when the Web Clients 140 choose specifically to compare two document's term document collections.

[0112] In one example embodiment, upon closer review of patents returned with a particular document based query and search result using only similarity approximations, the search initiator decides to compare her document based query specifically to search result item one. At this time Web Client 140 requests a Known Document Terms Match 630 or 640 which obtains the matching terms between both documents and an exact measure of similarity at that time. Finally, after Neighbor Candidates Processing 340 has identified the unknown document's top n most similar document references collection, the Matched Terms Result 620 is included in the HTTP response generated by Restful Services 130 and transmitted back the search initiator's Web Client 130.

[0113] The previous descriptions, for the purposes of explanation, have been detailed with reference to specific embodiments of the invention. However, the illustrative details are not intended to be exhaustive or limit the invention in any way to only the details which have been disclosed. A myriad of changes, alterations, transformations, and modifications may be suggested to one skilled in the art, and it is intended that the present invention encompass such changes, alterations, transformations, and modifications as fall within the scope of the appended claims. The embodiments were selected and explained to best embody the principals of the invention and its practical applications, to thereby enable others skilled in the art to best utilize the invention and various embodiments with changes, alterations, transformations, and modifications as are suited to the particular use contemplated.

DOCUMENT BASED QUERY AND INFORMATION RETRIEVAL SYSTEMS AND METHODS

Inventors

Cpc classification

Classification Explorer

G06F16/338

PHYSICS

Classification Explorer

G06F2216/11

PHYSICS

International classification

Classification Explorer

G06F17/30

PHYSICS

Abstract

Claims

Description