Patent classifications
G10L2025/906
REAL-TIME PITCH TRACKING BY DETECTION OF GLOTTAL EXCITATION EPOCHS IN SPEECH SIGNAL USING HILBERT ENVELOPE
A technique, suitable for real-time processing, is disclosed for pitch tracking by detection of glottal excitation epochs in speech signal. It uses Hilbert envelope to enhance saliency of the glottal excitation epochs and to reduce the ripples due to the vocal tract filter. The processing comprises the steps of dynamic range compression, calculation of the Hilbert envelope, and epoch marking. The Hilbert envelope is calculated using the output of a FIR filter based Hilbert transformer and the delay-compensated signal. The epoch marking uses a dynamic peak detector with fast rise and slow fall and nonlinear smoothing to further enhance the saliency of the epochs, followed by a differentiator or a Teager energy operator, and amplitude-duration thresholding. The technique is meant for use in speech codecs, voice conversion, speech and speaker recognition, diagnosis of voice disorders, speech training aids, and other applications involving pitch estimation.
Speaker recognition and speaker change detection
A method of speaker recognition comprises: receiving an audio signal comprising speech; performing a biometric process on a first part of the audio signal, wherein the first part of the audio signal extends over a first time period; obtaining a speaker recognition score from the biometric process for the first part of the audio signal; performing a biometric process on a plurality of second parts of the audio signal, wherein the second parts of the audio signal are successive sections of the first part of the audio signal, and wherein each second part of the audio signal extends over a second time period and the second time period is shorter than the first time period; obtaining a respective speaker recognition score from the biometric process for each second part of the audio signal; and determining whether there has been a speaker change based on the respective speaker recognition scores for successive second parts of the audio signal.
Machine learning based call routing-system
Machine learning technology can analyze in real-time the data from a call between a person and a customer service representative. Based on this analysis, a server can determine a sentiment score that describes a sentiment expressed by the person or the customer service representative. If the server determines that the sentiment score is less than or equal to a pre-determined value, the server can inform the customer service representative's manager so that the manager can take further action to help the person and/or the customer service representative.
Speech processing method, apparatus, and non-transitory computer-readable storage medium for storing a computer program for pitch frequency detection based upon a learned value
A speech processing method for estimating a pitch frequency includes: specifying, for each determination result of a speech-like-frame, a fundamental sound by using a plurality of local maximum values included in a spectrum of a respective frame determined as the speech-like-frame; obtaining a learned value by performing learning processing on a magnitude of the fundamental sound specified from each determination result of the speech-like-frame, the learned value including an average value and a variance of the magnitude of the fundamental sound specified from each determination result of the speech-like-frame; and executing a detection process by using the learned value, the detection process including detecting a pitch frequency of the respective frame determined as the speech-like-frame by using a threshold, the threshold being obtained by subtracting the variance included in the learned value from the average value included in the learned value.
Method and system for speaker loudness control
A mechanism to adjust far-end signal loudness based on environmental noise levels and device speaker characteristics has a noise-level analyzer that receives feedback from an intelligent speaker-boosting logic circuit that provides a signal to a class-D amplifier to drive the speaker. The noise-level analyzer analyzes near-end environmental noise levels and far-end speech input signal levels across critical frequency bands. The noise-level analyzer performs a masking analysis of the far-end and near-end signals, and guides the speaker-boosting logic circuit to apply determined signal boosting levels over selective bands. The speaker-boosting logic circuit monitors system activity along with the selective band boosting guidance from the noise-level analyzer. Using device speaker information and the speaker excursion pattern, the speaker-boosting logic circuit adjusts far-end speech signal loudness without over excursion of the speaker and damage to the speaker hardware.
Harmony generation device and storage medium
A harmony generation device and a program for the same which can generate a natural harmony sound are provided. The harmony generation device (1) generates first and second harmony tones to which a voice input through a microphone (M) is shifted in pitch by first and second shift amounts calculated based on both the voice input through the microphone (M) and a chord determined from performance information of an electric guitar (G) input through an input device (34). That is, since the first and second harmony tones can be tones based on the chord of the electric guitar (G) that changes from moment to moment, the harmony sound obtained by mixing the first and second harmony tones with the voice input through the microphone (M) can be a natural harmony sound that is rich in variation according to the chord of the electric guitar (G).
EXTRACTING CONTENT FROM SPEECH PROSODY
A prosodic speech recognition engine configured to identify prosodic features and patterns in a speech continuum for the extraction of linguistic content including para-syntactic content, discourse function, information structure, meaning, and speaker sentiment.
Singing assisting system, singing assisting method, and non-transitory computer-readable medium comprising instructions for executing the same
A singing assisting system, a singing assisting method, and a non-transitory computer-readable medium including instructions for executing the method are provided. When the performed singing track does not appear in an ought-to-be-performed period, a singing-continuing procedure is executed. When the performed singing track is off pitch, a pitch adjustment procedure is executed.
METHOD AND SYSTEM FOR SPEAKER LOUDNESS CONTROL
A mechanism to adjust far-end signal loudness based on environmental noise levels and device speaker characteristics has a noise-level analyzer that receives feedback from an intelligent speaker-boosting logic circuit that provides a signal to a class-D amplifier to drive the speaker. The noise-level analyzer analyzes near-end environmental noise levels and far-end speech input signal levels across critical frequency bands. The noise-level analyzer performs a masking analysis of the far-end and near-end signals, and guides the speaker-boosting logic circuit to apply determined signal boosting levels over selective bands. The speaker-boosting logic circuit monitors system activity along with the selective band boosting guidance from the noise-level analyzer. Using device speaker information and the speaker excursion pattern, the speaker-boosting logic circuit adjusts far-end speech signal loudness without over excursion of the speaker and damage to the speaker hardware.
System and method for relative enhancement of vocal utterances in an acoustically cluttered environment
The invention discloses systems and methods for enhancing the sound of vocal utterances of interest in an acoustically cluttered environment. The system generates canceling signals (sound suppression signals) for an ambient audio environment and identifies and characterizes desired vocal signals and hence a vocal stream or multiple streams of interest. Each canceling signal, or collectively, the noise canceling stream, is processed so that signals associated with the desired audio stream or streams are dynamically removed from the canceling stream. This modified noise canceling stream is combined (electronically or acoustically) with the ambient to effectuate a destructive interference of all ambient sound except for the removed audio streams, thus enhancing the vocal streams with respect to the unwanted ambient sound. Cepstral analysis may be used to identify a fundamental frequency associated with a voiced human utterance. Filtering derived from that analysis removes the voiced utterance from the canceling signal.