G10L2015/022

Real-Time Accent Conversion Model
20220358903 · 2022-11-10 ·

Techniques for real-time accent conversion are described herein. An example computing device receives an indication of a first accent and a second accent. The computing device further receives, via at least one microphone, speech content having the first accent. The computing device is configured to derive, using a first machine-learning algorithm trained with audio data including the first accent, a linguistic representation of the received speech content having the first accent. The computing device is configured to, based on the derived linguistic representation of the received speech content having the first accent, synthesize, using a second machine learning-algorithm trained with (i) audio data comprising the first accent and (ii) audio data including the second accent, audio data representative of the received speech content having the second accent. The computing device is configured to convert the synthesized audio data into a synthesized version of the received speech content having the second accent.

Speech recognition method and apparatus
11664020 · 2023-05-30 · ·

A speech recognition method comprises: generating, based on a preset speech knowledge source, a search space comprising preset client information and for decoding a speech signal; extracting a characteristic vector sequence of a to-be-recognized speech signal; calculating a probability at which the characteristic vector corresponds to each basic unit of the search space; and executing a decoding operation in the search space by using the probability as an input to obtain a word sequence corresponding to the characteristic vector sequence.

Sound sample verification for generating sound detection model

A method for verifying at least one sound sample to be used in generating a sound detection model in an electronic device includes receiving a first sound sample; extracting a first acoustic feature from the first sound sample; receiving a second sound sample; extracting a second acoustic feature from the second sound sample; and determining whether the second acoustic feature is similar to the first acoustic feature.

Systems and methods for an automatic language characteristic recognition system

In some embodiments, a method of creating an automatic language characteristic recognition system. The method can include receiving a plurality of audio recordings. The method also can include segmenting each of the plurality of audio recordings to create a plurality of audio segments for each audio recording. The method additionally can include clustering each audio segment of the plurality of audio segments according to audio characteristics of each audio segment to form a plurality of audio segment clusters. Other embodiments are provided.

METHODS AND SYSTEMS FOR IDENTIFYING KEYWORDS IN SPEECH SIGNAL
20170301341 · 2017-10-19 ·

The disclosed embodiments relate to a method of keyword recognition in a speech signal. The method includes determining a first likelihood score and a second likelihood score of one or more features of a frame of said speech signal being associated with one or more states in a first model and one or more states in a second model, respectively. The one or more states in the first model corresponds to one or more tied triphone states and the one or more states in the second model corresponds to one or more monophone states of a keyword to be recognized in the speech signal. The method further includes determining a third likelihood score based on the first likelihood score and the second likelihood score. The first likelihood score and the third likelihood score are utilizable to determine presence of the keyword in the speech signal.

METHOD AND APPARATUS FOR EXEMPLARY MORPHING COMPUTER SYSTEM BACKGROUND
20170249953 · 2017-08-31 ·

Method and apparatus for reducing a size of databases required for recorded speech data.

Monotone speech detection

Examples of the present disclosure describe systems and methods for detecting monotone speech. In aspects, audio data provided by a user may be received a device. Pitch values may be calculated and/or extracted from the audio data. The non-zero pitch values may be divided into clusters. For each cluster, a Pitch Variation Quotient (PVQ) value may be calculated. The weighted average of PVQ values across the clusters may be calculated and compared to a threshold for determining monotone speech. Based on the comparison, the audio data may be classified as monotone or non-monotone and an indication of the classification may be provided to the user in real-time via a user interface. Upon the completion of the audio session in which the audio data is received, feedback for the audio data may be provided to the user via the user interface.

Training acoustic models using connectionist temporal classification
11341958 · 2022-05-24 · ·

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training acoustic models and using the trained acoustic models. A connectionist temporal classification (CTC) acoustic model is accessed, the CTC acoustic model having been trained using a context-dependent state inventory generated from approximate phonetic alignments determined by another CTC acoustic model trained without fixed alignment targets. Audio data for a portion of an utterance is received. Input data corresponding to the received audio data is provided to the accessed CTC acoustic model. Data indicating a transcription for the utterance is generated based on output that the accessed CTC acoustic model produced in response to the input data. The data indicating the transcription is provided as output of an automated speech recognition service.

Training acoustic models using connectionist temporal classification
11769493 · 2023-09-26 · ·

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training acoustic models and using the trained acoustic models. A connectionist temporal classification (CTC) acoustic model is accessed, the CTC acoustic model having been trained using a context-dependent state inventory generated from approximate phonetic alignments determined by another CTC acoustic model trained without fixed alignment targets. Audio data for a portion of an utterance is received. Input data corresponding to the received audio data is provided to the accessed CTC acoustic model. Data indicating a transcription for the utterance is generated based on output that the accessed CTC acoustic model produced in response to the input data. The data indicating the transcription is provided as output of an automated speech recognition service.

Monotone Speech Detection

Examples of the present disclosure describe systems and methods for detecting monotone speech. In aspects, audio data provided by a user may be received a device. Pitch values may be calculated and/or extracted from the audio data. The non-zero pitch values may be divided into clusters. For each cluster, a Pitch Variation Quotient (PVQ) value may be calculated. The weighted average of PVQ values across the clusters may be calculated and compared to a threshold for determining monotone speech. Based on the comparison, the audio data may be classified as monotone or non-monotone and an indication of the classification may be provided to the user in real-time via a user interface. Upon the completion of the audio session in which the audio data is received, feedback for the audio data may be provided to the user via the user interface.