Patent classifications
G06F16/325
Jaccard similarity estimation of weighted samples: scaling and randomized rounding sample selection with circular smearing
The disclosed systems and methods include pre-calculation, per object, of object feature bin values, for identifying close matches between objects, such as text documents, that have numerous weighted features, such as specific-length word sequences. Predetermined feature weights get scaled with two or more selected adjacent scaling factors, and randomly rounded. The expanded set of weighted features of an object gets min-hashed into a predetermined number of feature bins. For each feature that qualifies to be inserted by min-hashing into a particular feature bin, and across successive feature bins, the expanded set of weighted features get min-hashed and circularly smeared into the predetermined number of feature bins. Completed pre-calculated sets of feature bin values for each scaling of the object, together with the scaling factor, are stored for use in comparing sampled features of the object with sampled features of other objects by calculating an estimated Jaccard similarity index.
System and method for fingerprinting-based conversation threading
Systems, methods, and computer readable media for staging a corpus of electronic communication documents for analysis, such as, for example, via a content analysis platform. The staging may include a staging platform accessing the corpus of electronic communication document. For each electronic communication document within the corpus, the staging platform may generate a fingerprint based upon the output of a hash function executed upon a set of characteristics corresponding to each segment within the electronic communication document. The staging platform may analyze the generated fingerprints to generated a plurality of threaded conversations that do not include electronic communication documents that fail to convey any new information. The systems and methods may also include detecting and flagging any segments within an electronic communication document that may have been mutated by its author.
WIDE AND DEEP NETWORK FOR LANGUAGE DETECTION USING HASH EMBEDDINGS
Techniques disclosed herein relate generally to language detection. In one particular aspect, a method is provided that includes obtaining a sequence of n-grams of a textual unit; using an embedding layer to obtain an ordered plurality of embedding vectors for the sequence of n-grams; using a deep network to obtain an encoded vector that is based on the ordered plurality of embedding vectors; and using a classifier to obtain a language prediction for the textual unit that is based on the encoded vector. The deep network includes an attention mechanism, and using the embedding layer to obtain the ordered plurality of embedding vectors comprises, for each n-gram in the sequence of n-grams: obtaining hash values for the n-gram; based on the hash values, selecting component vectors from among the plurality of component vectors; and obtaining an embedding vector for the n-gram that is based on the component vectors.
Knowledge graph data structures and uses thereof
Methods and systems are disclosed for generating and using a knowledge graph. In an aspect, the knowledge graph may be generated based on data fields for one or more datasets associated with one or more parameters extracted from a group of chart data structures. In another aspect, a query dataset may be analyzed, and one or more query data fields may be extracted from the query dataset. The one or more query data fields may be compared to a knowledge graph stored in a graph database to determine one or more result data fields. A context may be determined for each of the one or more result data fields, and an associated data set may be determined. Based on the context, each of the associated data sets may be scored, and a recommended analysis may be presented to a user.
Efficient search for combinations of matching entities given constraints
Methods, systems, and computer-readable storage media for receiving a set of inference results generated by a ML model, the inference results including a set of query entities and a set of target entities, each query entity having one or more target entities matched thereto by the ML model, processing the set of inference results to generate a set of matched sub-sets of target entities by executing a search over target entities in the set of target entities based on constraints, for each problem in a set of problems, providing the problem as a tuple including an index value representative of a target entity in the set of target entities and a value associated with the query entity, the value including a constraint relative to the query entity, and executing at least one task in response to one or more matched sub-sets in the set of matched sub-sets.
DATA STRUCTURES FOR EFFICIENT STORAGE AND UPDATING OF PARAGRAPH VECTORS
Systems and methods involving data structures for efficient management of paragraph vectors for textual searching are described. A database may contain records, each associated with an identifier and including a text string and timestamp. A look-up table may contain entries for text strings from the records, each entry associating: a paragraph vector for a respective unique text string, a hash of the respective unique text string, and a set of identifiers of records containing the respective unique text string. A server may receive from a client device an input string, compute a hash of the input string, and determine matching table entries, each containing a hash identical to that of the input string, or a paragraph vector similar to one calculated for the input string. A prioritized list of identifiers from the matching entries may be determined based on timestamps, and the prioritized list may be returned to the client.
Data driven relational algorithm formation for execution against big data
Techniques are described herein for creating an algorithm for batch mode processing against big data. The techniques involve receiving one or more user commands from a set number of commands that correspond one-to-one with a set number of low-level database operations. In a preferred embodiment, the set of database operations includes only FILTERS, SORTS, AGREGGATES, and JOINS. In the algorithm formation process, database operations are performed on a sample population of records. The user drills down to a set of useful records by performing database operations against the results of the previous database operations. While the database cluster is receiving operations, the system is tracking the operations in a dependency graph. The chains selected within the dependency graph indicate which operations are used to create the algorithm. To generate the algorithm, the database cluster reverse engineers the logic for performing those operations against big data.
TARGETED DOCUMENT ASSIGNMENTS IN AN ELECTRONIC DISCOVERY SYSTEM
Embodiments of the invention relate to systems, methods, and computer program products for improved electronic discovery. More specifically, embodiments relate to computer program products for targeted document review assignments by determining concept-related data groupings within the overall corpus of data associated with a case and assembling the targeted document review assignments based on the concept-related data groupings. As such, document reviewers are presented with assignments that have highly conceptually-related documents, which results in further efficiency in the review process.
SYSTEMS AND METHODS FOR PROVIDING A VISUALIZABLE RESULTS LIST
Systems and methods for displaying a visualizable results list in response to an electronic search request are disclosed. A method includes accessing metadata for each of a plurality of search results that result from a search query, annotating one or more locations in each search result with first and second indicators for each of one or more grouped search terms in first and second units based on the metadata, and displaying a visualizable results list that includes the plurality of search results and a corresponding hit pattern for each search result. The hit pattern includes the first indicator and the second indicator.
Dynamic document clustering and keyword extraction
Systems, methods and apparatuses are disclosed to cluster a plurality of documents located in any number of local and/or remote systems and applications. Preprocessed text is generated for each document, and a hash and a feature vector are determined based on the preprocessed text. A set of clusters is retrieved, wherein each cluster is associated with a hash list and a cumulative feature vector. Each of the documents may then be associated with a cluster by comparing the hash of the document to the hash lists of the clusters and/or by determining similarities between the feature vector of the document and the cumulative feature vectors of the clusters.