Patent classifications
G10L15/10
Robust audio identification with interference cancellation
Audio distortion compensation methods to improve accuracy and efficiency of audio content identification are described. The method is also applicable to speech recognition. Methods to detect the interference from speakers and sources, and distortion to audio from environment and devices, are discussed. Additional methods to detect distortion to the content after performing search and correlation are illustrated. The causes of actual distortion at each client are measured and registered and learnt to generate rules for determining likely distortion and interference sources. The learnt rules are applied at the client, and likely distortions that are detected are compensated or heavily distorted sections are ignored at audio level or signature and feature level based on compute resources available. Further methods to subtract the likely distortions in the query at both audio level and after processing at signature and feature level are described.
Voice alignment method and apparatus
Example methods and apparatus for providing voice alignment are described. One example method including: obtaining an original voice and a test voice, the test voice is a voice generated after the original voice is transmitted over a communications network; performing loss detection and/or discontinuity detection on the test voice, the loss detection is used to determine whether the test voice has a voice loss compared with the original voice, and the discontinuity detection is used to determine whether the test voice has voice discontinuity compared with the original voice; and aligning the test voice with the original voice based on a result of the loss detection and/or the discontinuity detection, to obtain an aligned original voice and an aligned test voice, the result of the loss detection and/or the discontinuity detection is used to indicate a manner of aligning the test voice with the original voice.
Voice alignment method and apparatus
Example methods and apparatus for providing voice alignment are described. One example method including: obtaining an original voice and a test voice, the test voice is a voice generated after the original voice is transmitted over a communications network; performing loss detection and/or discontinuity detection on the test voice, the loss detection is used to determine whether the test voice has a voice loss compared with the original voice, and the discontinuity detection is used to determine whether the test voice has voice discontinuity compared with the original voice; and aligning the test voice with the original voice based on a result of the loss detection and/or the discontinuity detection, to obtain an aligned original voice and an aligned test voice, the result of the loss detection and/or the discontinuity detection is used to indicate a manner of aligning the test voice with the original voice.
Method for searching for contents having same voice as voice of target speaker, and apparatus for executing same
A method for searching content having same voice as a voice of a target speaker from among a plurality of contents includes extracting a feature vector corresponding to the voice of the target speaker, selecting any subset of speakers from a training dataset repeatedly by a predetermined number of times, generating linear discriminant analysis (LDA) transformation matrices using each of the selected any subsets of speakers repeatedly by a predetermined number of times, projecting the extracted speaker feature vector to the selected corresponding subsets of speakers using each of the generated LDA transformation matrices, assigning a value corresponding to nearby speaker class among corresponding subsets of speakers, to each of projection regions of the extracted speaker feature vector, generating a hash value corresponding to the extracted feature vector based on the assigned values, and searching content having a similar hash value to the generated hash value among the contents.
Method for searching for contents having same voice as voice of target speaker, and apparatus for executing same
A method for searching content having same voice as a voice of a target speaker from among a plurality of contents includes extracting a feature vector corresponding to the voice of the target speaker, selecting any subset of speakers from a training dataset repeatedly by a predetermined number of times, generating linear discriminant analysis (LDA) transformation matrices using each of the selected any subsets of speakers repeatedly by a predetermined number of times, projecting the extracted speaker feature vector to the selected corresponding subsets of speakers using each of the generated LDA transformation matrices, assigning a value corresponding to nearby speaker class among corresponding subsets of speakers, to each of projection regions of the extracted speaker feature vector, generating a hash value corresponding to the extracted feature vector based on the assigned values, and searching content having a similar hash value to the generated hash value among the contents.
INFORMATION PROCESSING APPARATUS, INFORMATION PROCESSING METHOD, AND NON-TRANSITORY COMPUTER-READABLE RECORDING MEDIUM STORING INFORMATION PROCESSING PROGRAM FOR SELECTING SET VALUE USED TO EXECUTE FUNCTION
The processor of an information processing apparatus includes serves, by executing an information processing program, as: a function determiner; a morpheme analyzer configured to analyze a message input by a user in morphemes; a word detector configured to detect a predetermined time-representing word indicating temporal nearness or farness and a predetermined keyword which is modified by the time-representing word and which indicates settings associated with the function from the message analyzed in morphemes by the morpheme analyzer; a setting selector configured to select a newest set value when the word detector has detected the time-representing word indicating temporal nearness and to select a set value used when the user used the function in the past when the word detector has detected the time-representing word indicating temporal farness; and a function executor configured to execute the function determined by the function determiner using the set value selected by the setting selector.
INFORMATION PROCESSING APPARATUS, INFORMATION PROCESSING METHOD, AND NON-TRANSITORY COMPUTER-READABLE RECORDING MEDIUM STORING INFORMATION PROCESSING PROGRAM FOR SELECTING SET VALUE USED TO EXECUTE FUNCTION
The processor of an information processing apparatus includes serves, by executing an information processing program, as: a function determiner; a morpheme analyzer configured to analyze a message input by a user in morphemes; a word detector configured to detect a predetermined time-representing word indicating temporal nearness or farness and a predetermined keyword which is modified by the time-representing word and which indicates settings associated with the function from the message analyzed in morphemes by the morpheme analyzer; a setting selector configured to select a newest set value when the word detector has detected the time-representing word indicating temporal nearness and to select a set value used when the user used the function in the past when the word detector has detected the time-representing word indicating temporal farness; and a function executor configured to execute the function determined by the function determiner using the set value selected by the setting selector.
SPEECH RECOGNITION DEVICE AND OPERATING METHOD THEREOF
Provided are a method and device for speech recognition. The speech recognition method includes: receiving a speech signal generated by an utterance of a user; identifying a named entity from the received speech signal; determining a speech signal portion, which corresponds to the identified named entity, from the received speech signal; generating a first acoustic embedding vector corresponding to the speech signal portion, based on an acoustic embedding model; determining a second acoustic embedding vector that is one of a plurality of acoustic embedding vectors corresponding to a plurality of named entities included in an acoustic embedding database (DB), based on distances between the plurality of acoustic embedding vectors and the first acoustic embedding vector; determining a corrected named entity corresponding to the second acoustic embedding vector; and providing a result of speech recognition with respect to the speech signal, based on the corrected named entity.
Systems and methods related to automated transcription of voice communications
A method for selectively transcribing voice communications that includes: receiving keywords; receiving an audio stream of audio data of speech; searching the audio stream to detect keywords or keyword detections and recording parameter data for each that includes a location of the keyword within the audio stream; generating one or more cumulative datasets for one or more portions of the audio stream that each includes parameter data for the keyword detections occurring therein; for each of the one or more portions of the audio stream, calculating a transcription favorableness score via inputting the corresponding one of the one or more cumulative datasets into an algorithm; and determining whether to transcribe each of the one or more portions of the audio stream by comparing the corresponding transcription favorableness score against a predetermined threshold.
SOUND SIGNAL DATABASE GENERATION APPARATUS, SOUND SIGNAL SEARCH APPARATUS, SOUND SIGNAL DATABASE GENERATION METHOD, SOUND SIGNAL SEARCH METHOD, DATABASE GENERATION APPARATUS, DATA SEARCH APPARATUS, DATABASE GENERATION METHOD, DATA SEARCH METHOD, AND PROGRAM
To provide database generation techniques that can accurately and efficiently generate a database useable in text-based sound signal search. A sound signal database generation apparatus includes: a latent variable generation unit that generates, from a sound signal, a latent variable corresponding to the sound signal using a sound signal encoder; a data generation unit that generates a natural language representation corresponding to the sound signal from the latent variable and a condition concerning an index for a natural language representation using a natural language representation decoder; and a sound signal database generation unit that generates a record including the natural language representation corresponding to the sound signal and the sound signal from the natural language representation corresponding to the sound signal and the sound signal, and generates a sound signal database made up of the record.