Patent classifications
G10L15/12
Method and terminal device for video recording
Embodiments of the present disclosure provide a method for video recording, an apparatus for video recording, and a terminal device. The method can be applied to a first terminal. The first terminal can be configured to play a live video of a second terminal when the second terminal is on live. The method can include: obtaining a user identifier of a target audience logging into the first terminal; obtaining voice data of an anchor in the live video within a time period in response to detecting a first event, wherein the time period is after a current time point; and generating a video through screen recording based on the user identifier and the voice data.
USER VOICE ACTIVITY DETECTION USING DYNAMIC CLASSIFIER
A device includes a memory configured to store instructions and one or more processors configured execute the instructions. The one or more processors are configured execute the instructions to receive audio data including first audio data corresponding to a first output of a first microphone and second audio data corresponding to a second output of a second microphone. The one or more processors are also configured to execute the instructions to provide the audio data to a dynamic classifier. The dynamic classifier is configured to generate a classification output corresponding to the audio data. The one or more processors are further configured to execute the instructions to determine, at least partially based on the classification output, whether the audio data corresponds to user voice activity.
USER VOICE ACTIVITY DETECTION USING DYNAMIC CLASSIFIER
A device includes a memory configured to store instructions and one or more processors configured execute the instructions. The one or more processors are configured execute the instructions to receive audio data including first audio data corresponding to a first output of a first microphone and second audio data corresponding to a second output of a second microphone. The one or more processors are also configured to execute the instructions to provide the audio data to a dynamic classifier. The dynamic classifier is configured to generate a classification output corresponding to the audio data. The one or more processors are further configured to execute the instructions to determine, at least partially based on the classification output, whether the audio data corresponds to user voice activity.
Augmentation of audiographic images for improved machine learning
Generally, the present disclosure is directed to systems and methods that generate augmented training data for machine-learned models via application of one or more augmentation techniques to audiographic images that visually represent audio signals. In particular, the present disclosure provides a number of novel augmentation operations which can be performed directly upon the audiographic image (e.g., as opposed to the raw audio data) to generate augmented training data that results in improved model performance. As an example, the audiographic images can be or include one or more spectrograms or filter bank sequences.
Augmentation of audiographic images for improved machine learning
Generally, the present disclosure is directed to systems and methods that generate augmented training data for machine-learned models via application of one or more augmentation techniques to audiographic images that visually represent audio signals. In particular, the present disclosure provides a number of novel augmentation operations which can be performed directly upon the audiographic image (e.g., as opposed to the raw audio data) to generate augmented training data that results in improved model performance. As an example, the audiographic images can be or include one or more spectrograms or filter bank sequences.
Diagnostic Techniques Based on Speech-Sample Alignment
A method includes obtaining a first sequence of reference-sample feature vectors that quantify acoustic features of different respective portions of at least one reference speech sample, which was produced by a subject at a first time while a physiological state of the subject was known, and a second sequence of test-sample feature vectors that quantify the acoustic features of different respective portions of at least one test speech sample, which was produced by the subject at a second time while the physiological state of the subject was unknown. The test-sample feature vectors are mapped to respective ones of the reference-sample feature vectors, under predefined constraints, such that a total distance between the test-sample feature vectors and the respective ones of the reference-sample feature vectors is minimized. In response to the mapping, an output indicating the physiological state of the subject at the second time is generated.
METHOD AND TERMINAL DEVICE FOR VIDEO RECORDING
Embodiments of the present disclosure provide a method for video recording, an apparatus for video recording, and a terminal device. The method can be applied to a first terminal. The first terminal can be configured to play a live video of a second terminal when the second terminal is on live. The method can include: obtaining a user identifier of a target audience logging into the first terminal; obtaining voice data of an anchor in the live video within a time period in response to detecting a first event, wherein the time period is after a current time point; and generating a video through screen recording based on the user identifier and the voice data.
Dialogue system and dialogue processing method
A dialogue system for a vehicle may include: an input processor configured to receive a user's utterance, to acquire an utterance text by recognizing the user's utterance, to recognize a dialogue subject based on the acquired utterance text, and to identify the user; and a dialogue manager including a memory storing program instructions and a processor configured to execute the stored program instructions, the dialogue manager configured to verify whether a chat room related to the dialogue subject is present, and to determine whether to add the identified user as a participant of the chat room based on a result of the verification.
Diagnostic techniques based on speech-sample alignment
Reference-sample feature vectors that quantify acoustic features of different respective portions of at least one reference speech sample, which was produced by a subject at a first time while a physiological state of the subject was known, are obtained. At least one test speech sample that was produced by the subject at a second time, while the physiological state of the subject was unknown, is received. Test-sample feature vectors that quantify the acoustic features of different respective portions of the test speech sample are computed. The test-sample feature vectors are mapped to respective ones of the reference-sample feature vectors, under predefined constraints, such that a total distance between the test-sample feature vectors and the respective ones of the reference-sample feature vectors is minimized. In response to the mapping, an output indicating the physiological state of the subject at the second time is generated. Other embodiments are also described.
AUTOMATIC SPEECH RECOGNITION SYSTEM ADDRESSING PERCEPTUAL-BASED ADVERSARIAL AUDIO ATTACKS
A computer-implemented method for creating a combined audio signal in a speech recognition system, the method includes sampling the audio input signal to generate a time-domain sampled input signal, then converting the time-domain sampled input signal to a frequency-domain input signal, afterwards generating perceptual weights in response to frequency components of critical bands of the frequency-domain input signal, creating a time-domain adversary signal in response to the perceptual weights; and combining the time-domain adversary signal with the audio input signal to create a combined audio signal, wherein a speech processing of the combined audio signal will output a different result from speech processing of the audio input signal.