G06F16/3346

PRACTICAL SUPERVISED CLASSIFICATION OF DATA SETS

The present invention relates to information retrieval. In order to facilitate a search and identification of documents, there is provided a computer-implemented method for training a classifier model for data classification in response to a search query. The computer-implemented method comprises: a) obtaining a dataset that comprises a seed set of labeled data representing a training dataset; b) training the classifier model by using the training dataset to fit parameters of the classifier model; c) evaluating a quality of the classifier model using a test dataset that comprises unlabeled data from the obtained dataset to generate a classifier confidence score indicative of a probability of correctness of the classifier model working on the test dataset; d) determining a global risk value of misclassification and a reward value based on the classifier confidence score on the test dataset; e) iteratively updating the parameters of the classifier model and performing steps b) to d) until the global risk value falls within a predetermined risk limit value or an expected reward value is reached.

BUILDING DIMENSIONAL HIERARCHIES FROM FLAT DEFINITIONS AND PRE-EXISTING STRUCTURES
20170255655 · 2017-09-07 ·

Techniques are disclosed for generating an organized hierarchy from a set of related data. A request is received to generate an organized hierarchy from a data set. The data set includes labels and contextual cues associated with each of the of labels. For each label, one or more candidate labels are identified based on a probabilistic model generated from a plurality of known ontological hierarchies. The label is matched and assigned to one of the candidate labels. Hierarchy paths associated with the assigned candidate labels are joined with one another to build the organized hierarchy.

GRAPH MODEL BUILD AND SCORING ENGINE

Embodiments are directed to a method for accelerating machine learning using a plurality of graphics processing units (GPUs), involving receiving data for a graph to generate a plurality of random samples, and distributing the random samples across a plurality of GPUs. The method may comprise determining a plurality of communities from the random samples using unsupervised learning performed by each GPU. A plurality of sample groups may be generated from the communities and may be distributed across the GPUs, wherein each GPU merges communities in each sample group by converging to an optimal degree of similarity. In addition, the method may also comprise generating from the merged communities a plurality of subgraphs, dividing each sub-graph into a plurality of overlapping clusters, distributing the plurality of overlapping clusters across the plurality of GPUs, and scoring each cluster in the plurality of overlapping clusters to train an AI model.

Search system and corresponding method

There is provided a search system comprising a statistical model trained on text associated with a piece of content. The text associated with the piece of content is drawn from a plurality of different data sources. The system is configured to receive text input and generate using the statistical model an estimate of the likelihood that the piece of content is relevant given the text input. A corresponding method is also provided.

Machine learning classifiers

In an implementation, a non-transitory machine-readable storage medium stores instructions that when executed by a processor, cause the processor to allocate classifier data structures to persistent memory, read a number of categories from a set of training data, and populate the classifier data structures with training data including training-based, category and word probabilities calculated based on the training data.

Keyword extraction method and apparatus, storage medium, and electronic apparatus

A keyword extraction method is provided. In the method, a candidate keyword set in a target text is obtained by processing circuitry of a server. An extraction degree of the candidate keyword is determined by the processing circuitry based on subject similarity and a text conversion frequency of a candidate keyword in the candidate keyword set. The subject similarity is between the candidate keyword and the target text. The extraction degree indicates a probability at which the candidate keyword used as a keyword matching the target text is extracted. The keyword is extracted by the processing circuitry from the candidate keyword set according to the extraction degree.

LONG-TAIL KEYWORD IDENTIFICATION METHOD, KEYWOARD SEARCH METHOD, AND COMPUTER APPARATUS
20220207064 · 2022-06-30 · ·

The present invention relates to a long-tail keyword identification method, a keyword search method, and a computer apparatus. The long-tail keyword identification method includes: S101, receiving a search keyword, and identifying the number of atomic keywords included in the search keyword by means of a historical lexical database, wherein the historical lexical database comprises multiple atomic keywords and a weight value of each of the atomic keywords; and S102, if the search keyword comprises at least two atomic keywords, treating the search keyword as a combined keyword, and calculating a long-tail weight value for the combined keyword according to the weight values of all the atomic keywords in the combined keyword. The present invention effectively identifies a long-tail keyword, and calculates a long-tail weight value of the long-tail keyword, enhancing the accuracy of hitting a target in a searching process.

METHOD AND SYSTEM FOR MACHINE READING COMPREHENSION
20220198149 · 2022-06-23 · ·

A method for machine reading comprehension comprises obtaining question text and article text associated with the question text, generating first knowledge text corresponding to the question text and second knowledge text corresponding to the article text according to a knowledge set, encoding the question text and the article text to generate an original target text code, encoding the first knowledge text and the second knowledge text to generate a knowledge text code, performing a fusion operation on the original target text code and the knowledge text code to introduce part of knowledge in the knowledge set into the original target text code to generate a strengthened target text code, obtaining an answer corresponding to the question text based on the strengthened target text code, and outputting the answer.

METHOD AND SYSTEM FOR UNSTRUCTURED INFORMATION ANALYSIS USING A PIPELINE OF ML ALGORITHMS

A system and a method for increasing the classification confidence, with lesser dependence on large sets of training data, obtained by one or more machine learning based algorithms, by analyzing unstructured information using unstructured analysis pipeline comprising a probabilistic network such as a Bayesian network. The probabilistic network may comprise nodes associated with elements and cues defined by experts, and require fewer labelled data samples to train. The confidence level of the elements may be determined by machine learning and unstructured analysis methods and processed by the probabilistic network to estimate the confidence for a characterization quantity.

DISTRIBUTED MACHINE LEARNING HYPERPARAMETER OPTIMIZATION

Disclosed embodiments include a distributed hyperparameter (HP) tuning system, which includes a manager and a plurality of trainers. The manager continuously estimates HP sets for a machine learning (ML) model and distributes each HP set to respective trainers. Each trainer obtains a respective HP set and trains a local version of the ML model using the respective HP set. Each trainer determines a performance value for an HP sets used to train its local version of the ML model, and sends the performance value and the HP set to the manager. The manager estimates a new HP set from the HP set received from each trainer. The HP set estimation continues until convergence takes place. Other embodiments may be described and/or claimed.