Patent classifications
G06F40/149
Encoding of data formatted in human-readable text according to schema into binary
Data is organized in a hierarchical data tree having nodes, and is formatted in human-readable data according to a schema. The data is canonically ordered in correspondence with a canonical ordering of a schema dictionary generated from the schema. The canonically ordered data is encoded into binary, including for each node, removing a label of the node, and adding a sequence number of the node corresponding to the canonical ordering, in binary.
EVENT DETECTION BASED ON TEXT STREAMS
A text stream source is accessed that includes a plurality of text content items. Unique word groupings are determined for the plurality of text content items. A burst detection algorithm is executed to determine word groupings that are currently bursting and that started within a specified time period. Based on the word groupings, an issue is determined based on identifying a set of texts forming at least one clique.
EVENT DETECTION BASED ON TEXT STREAMS
A text stream source is accessed that includes a plurality of text content items. Unique word groupings are determined for the plurality of text content items. A burst detection algorithm is executed to determine word groupings that are currently bursting and that started within a specified time period. Based on the word groupings, an issue is determined based on identifying a set of texts forming at least one clique.
Method and system for extraction of relevant sections from plurality of documents
Embodiments of the present disclosure, implements method of extracting relevant sections from a plurality of documents by (a) receiving an input document from a user; (b) converting, the input document to a standard text file; (c) classifying, the standard text file to obtain a labelled text file associated with at least one cluster from a plurality of clusters; (d) extracting, from the labelled text file to obtain a plurality of relevant entities associated with at least one cluster in the plurality of clusters; (e) annotating, the standard text file by the extracted plurality of relevant entities to obtain an annotated enriched text file; (f) identifying, a plurality of section boundaries to obtain a sectioned data; and (g) extracting, relevant sections of the plurality of documents based on the plurality of relationship associated with the set of relevant entities.
Method and system for extraction of relevant sections from plurality of documents
Embodiments of the present disclosure, implements method of extracting relevant sections from a plurality of documents by (a) receiving an input document from a user; (b) converting, the input document to a standard text file; (c) classifying, the standard text file to obtain a labelled text file associated with at least one cluster from a plurality of clusters; (d) extracting, from the labelled text file to obtain a plurality of relevant entities associated with at least one cluster in the plurality of clusters; (e) annotating, the standard text file by the extracted plurality of relevant entities to obtain an annotated enriched text file; (f) identifying, a plurality of section boundaries to obtain a sectioned data; and (g) extracting, relevant sections of the plurality of documents based on the plurality of relationship associated with the set of relevant entities.
STREAMING CONTEXTUAL UNIDIRECTIONAL MODELS
Streaming machine learning unidirectional models is facilitated by the use of embedding vectors. Processing blocks in the models apply embedding vectors as input. The embedding vectors utilize context of future data (e.g., data that is temporally offset into the future within a data stream) to improve the accuracy of the outputs generated by the processing blocks. The embedding vectors cause a temporal shift between the outputs of the processing blocks and the inputs to which the outputs correspond. This temporal shift enables the processing blocks to apply the embedding vector inputs from processing blocks that are associated with future data.
SYSTEM AND METHOD FOR OBTAINING DOCUMENTS FROM A COMPOSITE FILE
A system for obtaining documents from a composite file comprising a stream of multiple pages is provided. The system may comprise one or more processors configured to receive the composite file comprising the multiple pages and split the composite file to obtain individual pages of the composite file, wherein image of each of the individual pages and image vector for each of the individual pages from the image of the respective page may be obtained. The processor may further obtain text present in each of the individual pages and text vector for each of the individual pages from the text of the respective page. The processor may further determine continuity pattern between pages that are consecutive based on the image vector and the text vector of the consecutive pages and may categorize the consecutive pages as belonging to the same document in case the determined continuity pattern between the consecutive pages indicate that the consecutive pages belong to the same document.
SYSTEM AND METHOD FOR OBTAINING DOCUMENTS FROM A COMPOSITE FILE
A system for obtaining documents from a composite file comprising a stream of multiple pages is provided. The system may comprise one or more processors configured to receive the composite file comprising the multiple pages and split the composite file to obtain individual pages of the composite file, wherein image of each of the individual pages and image vector for each of the individual pages from the image of the respective page may be obtained. The processor may further obtain text present in each of the individual pages and text vector for each of the individual pages from the text of the respective page. The processor may further determine continuity pattern between pages that are consecutive based on the image vector and the text vector of the consecutive pages and may categorize the consecutive pages as belonging to the same document in case the determined continuity pattern between the consecutive pages indicate that the consecutive pages belong to the same document.
METHODS, SYSTEMS, AND STORAGE MEDIA FOR AUTOMATICALLY IDENTIFYING RELEVANT CHEMICAL COMPOUNDS IN PATENT DOCUMENTS
Methods, systems, and non-transitory media for training a chemical entity recognition system to extract chemical compounds from a patent document and determine a relevance of the chemical compounds to the patent document are disclosed. A method includes obtaining patent documents from patent databases, normalizing each patent document into a unified format, and generating a chemical patent corpus. The chemical patent corpus includes chemical entities, each having relevancy annotations that indicate a relevance to the patent document from which the chemical entity is extracted. The method further includes providing the chemical patent corpus to the chemical entity recognition system, which tags the one or more chemical entities in a corresponding normalized patent document, extracts additional chemical entities, assigns a confidence score to each additional chemical entity, and labels each additional chemical entity as relevant or irrelevant to an associated patent document based on information contained in the chemical patent corpus.
METHODS, SYSTEMS, AND STORAGE MEDIA FOR AUTOMATICALLY IDENTIFYING RELEVANT CHEMICAL COMPOUNDS IN PATENT DOCUMENTS
Methods, systems, and non-transitory media for training a chemical entity recognition system to extract chemical compounds from a patent document and determine a relevance of the chemical compounds to the patent document are disclosed. A method includes obtaining patent documents from patent databases, normalizing each patent document into a unified format, and generating a chemical patent corpus. The chemical patent corpus includes chemical entities, each having relevancy annotations that indicate a relevance to the patent document from which the chemical entity is extracted. The method further includes providing the chemical patent corpus to the chemical entity recognition system, which tags the one or more chemical entities in a corresponding normalized patent document, extracts additional chemical entities, assigns a confidence score to each additional chemical entity, and labels each additional chemical entity as relevant or irrelevant to an associated patent document based on information contained in the chemical patent corpus.