G10L15/25

VIRTUAL OBJECT LIP DRIVING METHOD, MODEL TRAINING METHOD, RELEVANT DEVICES AND ELECTRONIC DEVICE
20220383574 · 2022-12-01 ·

A virtual object lip driving method performed by an electronic device includes: obtaining a speech segment and target face image data about a virtual object; and inputting the speech segment and the target face image data into a first target model to perform a first lip driving operation, so as to obtain first lip image data about the virtual object driven by the speech segment. The first target model is trained in accordance with a first model and a second model, the first model is a lip-speech synchronization discriminative model with respect to lip image data, and the second model is a lip-speech synchronization discriminative model with respect to a lip region in the lip image data.

VIRTUAL OBJECT LIP DRIVING METHOD, MODEL TRAINING METHOD, RELEVANT DEVICES AND ELECTRONIC DEVICE
20220383574 · 2022-12-01 ·

A virtual object lip driving method performed by an electronic device includes: obtaining a speech segment and target face image data about a virtual object; and inputting the speech segment and the target face image data into a first target model to perform a first lip driving operation, so as to obtain first lip image data about the virtual object driven by the speech segment. The first target model is trained in accordance with a first model and a second model, the first model is a lip-speech synchronization discriminative model with respect to lip image data, and the second model is a lip-speech synchronization discriminative model with respect to a lip region in the lip image data.

Spatially informed audio signal processing for user speech

A device implementing a system for processing speech in an audio signal includes at least one processor configured to receive an audio signal corresponding to at least one microphone of a device, and to determine, using a first model, a first probability that a speech source is present in the audio signal. The at least one processor is further configured to determine, using a second model, a second probability that an estimated location of a source of the audio signal corresponds to an expected position of a user of the device, and to determine a likelihood that the audio signal corresponds to the user of the device based on the first and second probabilities.

Spatially informed audio signal processing for user speech

A device implementing a system for processing speech in an audio signal includes at least one processor configured to receive an audio signal corresponding to at least one microphone of a device, and to determine, using a first model, a first probability that a speech source is present in the audio signal. The at least one processor is further configured to determine, using a second model, a second probability that an estimated location of a source of the audio signal corresponds to an expected position of a user of the device, and to determine a likelihood that the audio signal corresponds to the user of the device based on the first and second probabilities.

Intelligence device and user selection method thereof
11514725 · 2022-11-29 · ·

Disclosed are an intelligence device and a method of selecting a user of the intelligence device. According to an embodiment of the disclosure, the intelligence device may analyze the eye blinks and pupil shapes of persons and select the person gazing at the intelligence device as a user. According to an embodiment, the intelligence device may be related to artificial intelligence (AI) modules, robots, augmented reality (AR) devices, virtual reality (VR) devices, and 5G service-related devices.

Intelligence device and user selection method thereof
11514725 · 2022-11-29 · ·

Disclosed are an intelligence device and a method of selecting a user of the intelligence device. According to an embodiment of the disclosure, the intelligence device may analyze the eye blinks and pupil shapes of persons and select the person gazing at the intelligence device as a user. According to an embodiment, the intelligence device may be related to artificial intelligence (AI) modules, robots, augmented reality (AR) devices, virtual reality (VR) devices, and 5G service-related devices.

Whispering voice recovery method, apparatus and device, and readable storage medium

A method, an apparatus and a device for converting a whispered speech, and a readable storage medium are provided. The method is implemented based on the whispered speech converting model. The whispered speech converting model is trained in advance by using recognition results and whispered speech training acoustic features of whispered speech training data as samples and using normal speech acoustic features of normal speech data parallel to the whispered speech training data as sample labels. A whispered speech acoustic feature and a preliminary recognition result of whispered speech data are acquired, then the whispered speech acoustic feature and the preliminary recognition result are inputted into a preset whispered speech converting model to acquire a normal speech acoustic feature outputted by the model. In this way, the whispered speech can be converted to a normal speech.

Whispering voice recovery method, apparatus and device, and readable storage medium

A method, an apparatus and a device for converting a whispered speech, and a readable storage medium are provided. The method is implemented based on the whispered speech converting model. The whispered speech converting model is trained in advance by using recognition results and whispered speech training acoustic features of whispered speech training data as samples and using normal speech acoustic features of normal speech data parallel to the whispered speech training data as sample labels. A whispered speech acoustic feature and a preliminary recognition result of whispered speech data are acquired, then the whispered speech acoustic feature and the preliminary recognition result are inputted into a preset whispered speech converting model to acquire a normal speech acoustic feature outputted by the model. In this way, the whispered speech can be converted to a normal speech.

Generating modified digital images utilizing a dispersed multimodal selection model

The present disclosure relates to systems, methods, and non-transitory computer readable media for generating modified digital images based on verbal and/or gesture input by utilizing a natural language processing neural network and one or more computer vision neural networks. The disclosed systems can receive verbal input together with gesture input. The disclosed systems can further utilize a natural language processing neural network to generate a verbal command based on verbal input. The disclosed systems can select a particular computer vision neural network based on the verbal input and/or the gesture input. The disclosed systems can apply the selected computer vision neural network to identify pixels within a digital image that correspond to an object indicated by the verbal input and/or gesture input. Utilizing the identified pixels, the disclosed systems can generate a modified digital image by performing one or more editing actions indicated by the verbal input and/or gesture input.

Generating modified digital images utilizing a dispersed multimodal selection model

The present disclosure relates to systems, methods, and non-transitory computer readable media for generating modified digital images based on verbal and/or gesture input by utilizing a natural language processing neural network and one or more computer vision neural networks. The disclosed systems can receive verbal input together with gesture input. The disclosed systems can further utilize a natural language processing neural network to generate a verbal command based on verbal input. The disclosed systems can select a particular computer vision neural network based on the verbal input and/or the gesture input. The disclosed systems can apply the selected computer vision neural network to identify pixels within a digital image that correspond to an object indicated by the verbal input and/or gesture input. Utilizing the identified pixels, the disclosed systems can generate a modified digital image by performing one or more editing actions indicated by the verbal input and/or gesture input.