Patent classifications
G06F16/3343
Database management for sound-based identifiers
Described herein are techniques, devices, and systems for database management using sound-based identifiers. The sound-based identifiers can be encoded based on text-based identifiers input into one or more databases. The sound-based identifiers can be preprocessed and encoded by encoding the text-based identifiers with a double metaphone algorithm. First sound-based identifiers can be sorted in a cluster associated with a node of a hybrid prefix tree list, based on a longest common prefix of the group. The first sound-based identifiers can be re-encoded as second sound-based identifiers and organized into sub-clusters associated with nodes, based on characters of the second sound-based identifiers positioned after characters associated with the clusters. The re-encoded sound-based identifiers can be determined based on metadata. A query can be received and utilized to identify a re-encoded sound-based identifier. Data associated with the re-encoded sound-based identifier can be transmitted based on the query.
SEMANTIC DUPLICATE NORMALIZATION AND STANDARDIZATION
Systems, methods, and computer-readable media are disclosed for list attribute normalization and standardization for creation of a controlled vocabulary. A vocabulary set comprising a plurality of vocabulary term may be received. For each vocabulary term, semantic duplicates may be identified. The semantic duplicates may be identified by analyzing semantics, syntactics, or phonetics of the vocabulary terms. Semantic chains may be formed from each vocabulary term and the corresponding semantic duplicates. The terms in each semantic chain may be ranked to determine a most probable vocabulary term. The most probable vocabulary term may then replace the semantic chain. The most probable vocabulary term across all semantic chains from the vocabulary set may form the controlled vocabulary.
Method and system for retrieval based on an inexact full-text search
The disclosed search engine and search engine system apply a variety of criteria to find specific information within a full-text dataset, even when its user cannot recall the exact text or exact spelling of the desired information. The criteria cause retrieval of text not only when the sequence of words in the query matches the sequence of words in the text identically but also when the difference between sequences is that one of the words in a sequence is missing, added, replaced, or replaced with a synonymous word or a word having phonetic similarity. Another criterion is that the two words in the query sequence differ in order from the two words in the text sequence to be retrieved. The above criteria are applied after stop list words are disregarded. The search engine accordingly enables a user to find text more easily in large full-text datasets by inexact text searching.
DATABASE MANAGEMENT FOR SOUND-BASED IDENTIFIERS
Described herein are techniques, devices, and systems for database management using sound-based identifiers. The sound-based identifiers can be encoded based on text-based identifiers input into one or more databases. The sound-based identifiers can be preprocessed and encoded by encoding the text-based identifiers with a double metaphone algorithm. First sound-based identifiers can be sorted in a cluster associated with a node of a hybrid prefix tree list, based on a longest common prefix of the group. The first sound-based identifiers can be re-encoded as second sound-based identifiers and organized into sub-clusters associated with nodes, based on characters of the second sound-based identifiers positioned after characters associated with the clusters. The re-encoded sound-based identifiers can be determined based on metadata. A query can be received and utilized to identify a re-encoded sound-based identifier. Data associated with the re-encoded sound-based identifier can be transmitted based on the query.
Text encoding issue detection
Method and apparatus for detecting text encoding errors caused by previously encoding the electronic document in multiple encoding formats. Non-word portions are removed from the electronic document. Embodiments determine whether words in the electronic document are likely to contain one or more text encoding errors, by dividing the first word into n-grams of length 2 or more. For each of the plurality of n-grams, a database is queried to determine a respective probability of the n-gram appearing in each of a plurality of recognized languages, and upon determining that the determined probabilities of two consecutive n-grams are each less than a predefined threshold probability, the first word is added to a list of words that likely contain text encoding errors. A confidence level that the first word includes the one or more text encoding errors is calculated, based on a lowest determined probably for the n-grams for the first word.
Prescan device activation prevention
A method and system for improving audio detection is provided. The method includes receiving activation term data and text data of a multimedia file. The text data is analyzed and potential phonetic matches between a set of terms and the activation term are determined. An audio portion of the multimedia file is analyzed with respect to the potential phonetic matches and a resulting subset of terms is determined. A term is selected from the subset and flagged. An associated control action for preventing an automated device from being enabled is generated and stored. Presentation of the flagged term is detected within the multimedia file being presented and the control action is executed such that the automated device remains in the deactivated state.
Query disambiguation using environmental audio
One embodiment provides a method, including: capturing, using at least one sensor of an information handling device, environmental audio; receiving, at an audio capture device associated with the information handling device, a user query, wherein the user query comprises at least one deictic term; disambiguating, using a processor and by analyzing the captured environmental audio, the user query; and providing, based on the disambiguating, a response to the user query. Other aspects are described and claimed.
Method, and device for matching speech with text, and computer-readable storage medium
Embodiments of a method and device for matching a speech with a text, and a computer-readable storage medium are provided. The method can include: acquiring a speech identification text by identifying a received speech signal; comparing the speech identification text with multiple candidate texts in a first matching mode to determine a first matching text; and comparing phonetic symbols of the speech identification text with phonetic symbols of the multiple candidate texts in a second matching mode to determine a second matching text, in a case that no first matching text is determined.
Identification of users across multiple platforms
A computer system creates a plurality of indexes from a first plurality of records, wherein each index corresponds to an attribute of a plurality of attributes. The computer system detects a record of a second plurality of records, wherein the record includes a value corresponding to each of the plurality of attributes. The computer system determines a first set of values from a first index of the plurality of indexes that corresponds to a first attribute. The computer system determines a plurality of individual similarity scores for the first set of values by utilizing a similarity function. The computer system determines an overall similarity score for each record of at least a portion of the first plurality of records and based on the overall similarity scores, determines a record of the first plurality of records that corresponds to the record of the second plurality of records.
SIMILARITY PROCESSING METHOD, APPARATUS, SERVER AND STORAGE MEDIUM
The present application discloses a similarity processing method, an apparatus, a server and a storage medium, and relates to the fields of information processing and natural language processing. The specific implementation solution is as follows: acquiring a first character string and a second character string; determining a pronunciation pattern similarity and a character pattern similarity between the first character string and the second character string; and determining a comprehensive similarity between the first character string and the second character string, based on the pronunciation pattern similarity and the character pattern similarity.