G10L25/24

Device wakeup method and apparatus, electronic device, and storage medium

The present disclosure relates to a device wakeup method and apparatus, an electronic device, and a storage medium. The wakeup method is applied to a first electronic device and includes: a wakeup message from a second electronic device is received, and when it is determined that a present state is an unawakened state, locally collected voice data is acquired; MFCC extraction is performed on the voice data to acquire a first MFCC of the voice data; the wakeup message is parsed to obtain a second MFCC included in the wakeup message; the first MFCC and the second MFCC are matched, and when it is determined that a difference between the first MFCC and the second MFCC is less than or equal to a set threshold value, a wakeup instruction is generated; and responsive to the wakeup instruction, the first electronic device is woken up.

Smart device input method based on facial vibration
11662610 · 2023-05-30 · ·

A smart device input method based on facial vibration includes: collecting a facial vibration signal generated when a user performs voice input; extracting a Mel-frequency cepstral coefficient from the facial vibration signal; and taking the Mel-frequency cepstral coefficient as an observation sequence to obtain text input corresponding to the facial vibration signal by using a trained hidden Markov model. The facial vibration signal is collected by a vibration sensor arranged on glasses. The vibration signal is processed by: amplifying the collected facial vibration signal; transmitting the amplified facial vibration signal to the smart device via a wireless module; and intercepting a section from the received facial vibration signal as an effective portion and extracting the Mel-frequency cepstral coefficient from the effective portion by the smart device.

Sound event detection learning

A device includes a processor configured to receive audio data samples and provide the audio data samples to a first neural network to generate a first output corresponding to a first set of sound classes. The processor is further configured to provide the audio data samples to a second neural network to generate a second output corresponding to a second set of sound classes. A second count of classes of the second set of sound classes is greater than a first count of classes of the first set of sound classes. The processor is also configured to provide the first output to a neural adapter to generate a third output corresponding to the second set of sound classes. The processor is further configured to provide the second output and the third output to a merger adapter to generate sound event identification data based on the audio data samples.

Sound event detection learning

A device includes a processor configured to receive audio data samples and provide the audio data samples to a first neural network to generate a first output corresponding to a first set of sound classes. The processor is further configured to provide the audio data samples to a second neural network to generate a second output corresponding to a second set of sound classes. A second count of classes of the second set of sound classes is greater than a first count of classes of the first set of sound classes. The processor is also configured to provide the first output to a neural adapter to generate a third output corresponding to the second set of sound classes. The processor is further configured to provide the second output and the third output to a merger adapter to generate sound event identification data based on the audio data samples.

HEALTH MONITORING SYSTEM AND APPLIANCE
20230162732 · 2023-05-25 ·

Systems and methods are disclosed. A digitized human vocal expression of a user and digital images are received over a network from a remote device. The digitized human vocal expression is processed to determine characteristics of the human vocal expression, including: pitch, volume, rapidity, a magnitude spectrum identify, and/or pauses in speech. Digital images are received and processed to detect characteristics of the user face, including detecting if any of the following is present: a sagging lip, a crooked smile, uneven eyebrows, and/or facial droop. Using the human vocal expression characteristics and face characteristics, a determination is made as to what action is to be taken. A cepstrum pitch may be determined using an inverse Fourier transform of a logarithm of a spectrum of a human vocal expression signal. The volume may be determined using peak heights in a power spectrum of the human vocal expression.

HEALTH MONITORING SYSTEM AND APPLIANCE
20230162732 · 2023-05-25 ·

Systems and methods are disclosed. A digitized human vocal expression of a user and digital images are received over a network from a remote device. The digitized human vocal expression is processed to determine characteristics of the human vocal expression, including: pitch, volume, rapidity, a magnitude spectrum identify, and/or pauses in speech. Digital images are received and processed to detect characteristics of the user face, including detecting if any of the following is present: a sagging lip, a crooked smile, uneven eyebrows, and/or facial droop. Using the human vocal expression characteristics and face characteristics, a determination is made as to what action is to be taken. A cepstrum pitch may be determined using an inverse Fourier transform of a logarithm of a spectrum of a human vocal expression signal. The volume may be determined using peak heights in a power spectrum of the human vocal expression.

KEYWORD SPOTTING APPARATUS, METHOD, AND COMPUTER-READABLE RECORDING MEDIUM THEREOF
20230162724 · 2023-05-25 ·

A keyword spotting apparatus, method, and computer-readable recording medium are disclosed. The keyword spotting method using an artificial neural network according to an embodiment of the disclosure may include obtaining an input feature map from an input voice; performing a first convolution operation on the input feature map for each of n different filters having the same channel length as the input feature map, wherein a width of each of the filters is w1 and the width w1 is less than a width of the input feature map; performing a second convolution operation on a result of the first convolution operation for each of different filters having the same channel length as the input feature map; storing a result of the second convolution operation as an output feature map; and extracting a voice keyword by applying the output feature map to a learned machine learning model.

KEYWORD SPOTTING APPARATUS, METHOD, AND COMPUTER-READABLE RECORDING MEDIUM THEREOF
20230162724 · 2023-05-25 ·

A keyword spotting apparatus, method, and computer-readable recording medium are disclosed. The keyword spotting method using an artificial neural network according to an embodiment of the disclosure may include obtaining an input feature map from an input voice; performing a first convolution operation on the input feature map for each of n different filters having the same channel length as the input feature map, wherein a width of each of the filters is w1 and the width w1 is less than a width of the input feature map; performing a second convolution operation on a result of the first convolution operation for each of different filters having the same channel length as the input feature map; storing a result of the second convolution operation as an output feature map; and extracting a voice keyword by applying the output feature map to a learned machine learning model.

SYSTEMS AND METHODS FOR SPEECH ENHANCEMENT USING ATTENTION MASKING AND END TO END NEURAL NETWORKS
20230162758 · 2023-05-25 ·

A neural network-based end-to-end single-channel speech enhancement system designed for joint suppression of noise and reverberation, which can include attention masking. The neural network architecture can contain both an enhancement and an autoencoder path, so that disabling the masking mechanism causes reconstruction of the input speech signal. The autoencoder path and the enhancement can be simultaneously trained using a loss function that includes a perceptually-motivated waveform distance measure. Examples enable dynamic control of the level of suppression applied via a minimum gain level. A novel loss function can be utilized to simultaneously train both the enhancement and the autoencoder paths, which includes a perceptually-motivated waveform distance measure. Examples provide significant levels of noise suppression while maintaining high speech quality. Examples can also improve the performance of automated speech systems, such as speaker and language recognition, when used as a pre-processing step.

SYSTEMS AND METHODS FOR SPEECH ENHANCEMENT USING ATTENTION MASKING AND END TO END NEURAL NETWORKS
20230162758 · 2023-05-25 ·

A neural network-based end-to-end single-channel speech enhancement system designed for joint suppression of noise and reverberation, which can include attention masking. The neural network architecture can contain both an enhancement and an autoencoder path, so that disabling the masking mechanism causes reconstruction of the input speech signal. The autoencoder path and the enhancement can be simultaneously trained using a loss function that includes a perceptually-motivated waveform distance measure. Examples enable dynamic control of the level of suppression applied via a minimum gain level. A novel loss function can be utilized to simultaneously train both the enhancement and the autoencoder paths, which includes a perceptually-motivated waveform distance measure. Examples provide significant levels of noise suppression while maintaining high speech quality. Examples can also improve the performance of automated speech systems, such as speaker and language recognition, when used as a pre-processing step.