Patent classifications
G06F16/325
Computer-implemented system and method for identifying near duplicate documents
A computer-implemented system and method for identifying near duplicate documents is provided. A set of documents is obtained and each document is divided into segments. Each of the segments is hashed. A segment identification and sequence order is assigned to each of the hashed segments. The sequence order is based on an order in which the segments occur in one such document. The segments are compared based on the segment identification and those documents with at least two matching segments are identified. The sequence orders of the matching segments are compared and based on the comparison, a determination is made that the identified documents share a relative sequence of the matching segments. The identified documents are designated as near duplicate documents.
METHOD FOR DIALOGUE PROCESSING, ELECTRONIC DEVICE AND STORAGE MEDIUM
A method for dialogue processing, an electronic device and a storage medium are provided. The specific technical solution includes: obtaining a dialogue history; selecting a target machine from a plurality of machines; inputting the dialogue history into a trained dialogue model in the target machine to generate a response to the dialogue history, in which the dialogue model comprises a common parameter and a specific parameter, and different machines correspond to the same common parameter.
COMMENT MANAGEMENT METHOD, SERVER AND READABLE STORAGE MEDIUM
A comment management method applied to a server is provided. The method includes detecting comment parameters of an article. Whether or not to activate a comment management mechanism for the article is determined according to the comment parameters of the article. All comments of the article are managed by activating the comment management mechanism. Once an activation duration of the comment management mechanism is activated, the comment management mechanism is canceled when the activation duration reaches a preset duration.
Searching for a hash string stored in an indexed array
Systems and methods for searching for hash strings stored in an indexed array are described. A method includes receiving a hash string. The method includes determining a first set of index values for the indexed array that correspond to a first stored value matching a first portion of the hash string and determining a second set of index values for the indexed array that correspond to a second stored value matching a second portion of the hash string if a match between the first stored value and the first portion of the hash string is found. The method includes upon finding a match for the first stored value and the second stored value, comparing the hash string to each of the set of hash strings in the indexed array having an index value common to both the first set of index values and the second set of index values.
Scale-out indexing for a distributed search engine
Methods and systems for data indexing are disclosed. According to some embodiments, an index is split into a number of slots based on a slot power value. Each of the slots is assigned with a slot number. A first subset of the slots is allocated to a first shard mapped to the index. A second subset of the slots is allocated to a second shard mapped to the index. The first subset and the second subset are respectively allocated to the first shard and the second shard based on a shard-slot mapping.
Revealing content reuse using fine analysis
Systems and methods for managing content provenance are provided. A network system accesses a document of a plurality of documents to be analyzed. The network system extracts text fragments from the document including a first fragment and a second fragment. A determination is made whether each of the text fragments match an entry in a hash table. Based on a first fragment not matching any entries in the hash table, the network system creates a new entry in the hash table, whereby the first fragment is used to generate a key in the hash table. Based on a second fragment matching an entry of the hash table, the network system associates the document with a key of the matching entry in the hash table, whereby the associating comprising updating the hash table with an identifier of the document.
Method and system for classification of unstructured data items
Methods, computer program products, and computer systems for the classification of unstructured data items are disclosed. Such methods, computer program products, and computer systems include ingesting an item into a classification engine, performing term processing on one or more terms of the item, and processing a relational similarity index. The classification engine is implemented in the computer system. The relational similarity index represents a similarity of the item to a reference item, and the relational similarity index is determined using the one or more terms.
MASK-AUGMENTED INVERTED INDEX
The embodiments disclosed herein are related to a computing system for generating a mask-augmented inverted index. The mask-augmented inverted index is structured to allow phrase query searching while minimizing the amount of computing system processing and memory resources needed to generate the mask-augmented inverted index. In one embodiment, a first token is mapped to a first listing of documents that include the first token. A first mask is included that comprises a probabilistic representation of a set of integers corresponding to one or more locations of the first token in each of the individual documents of the first listing. A second mask is included that comprises a probabilistic representation of a set of integers that indicate a positional relationship between the first token and one or more other tokens in each of the individual documents of the first listing.
HASHING ELECTRONIC RECORDS
Provided is a method, computer program product, and system for hashing electronic health records. A processor may collect a set of electronic health records (EHRs). The processor may perform an encounter analysis on the set of EHRs to determine a set of attributes associated to the set of EHRs. The processor may hash the set of attributes to generate one or more hashing indexes that correspond to the set of EHRs. The processor may store the one or more hashing indexes in a list used for document retrieval.
SEARCH SYSTEM FOR PROVIDING SEARCH RESULTS USING QUERY UNDERSTANDING AND SEMANTIC BINARY SIGNATURES
Technology for the improved processing of search queries is provided. In one embodiment, methods may return semantically relevant search results for a search query. During a pre-computing offline processing, an inventory semantic index may be generated and may include inventory binary hashing signatures that are associated with inventory listings, such as goods or services for sell, and the index may be partitioned by categories and shards. When a search query is received, relevant categories are determined using a relevant category recognition service, and a search query binary hashing signature maybe generated for the search query. The relevant categories are searched to determine hamming distances between the inventory binary hashing signatures and the search query binary hashing signature, where the hamming distance indicates semantic relevance.