Patent classifications
G10L21/10
METHOD AND SYSTEM FOR INSTRUMENT SEPARATING AND REPRODUCING FOR MIXTURE AUDIO SOURCE
A method and a system for instrument separating and reproducing for a mixture audio source is provided. The method and/or the system includes inputting selected music into an instrument separation model for extracting features therefrom, determining audio source signals of multiple channels for the separation of all instruments, each channel containing sound of one instrument, and transmitting the signals of the different channels to multiple speakers placed at designated positions for playing, which can reproduce or recreate an immersive sound field listening experience for users.
METHOD AND SYSTEM FOR INSTRUMENT SEPARATING AND REPRODUCING FOR MIXTURE AUDIO SOURCE
A method and a system for instrument separating and reproducing for a mixture audio source is provided. The method and/or the system includes inputting selected music into an instrument separation model for extracting features therefrom, determining audio source signals of multiple channels for the separation of all instruments, each channel containing sound of one instrument, and transmitting the signals of the different channels to multiple speakers placed at designated positions for playing, which can reproduce or recreate an immersive sound field listening experience for users.
RECORDING SYSTEM FOR GENERATING A TRANSCRIPT OF A DIALOGUE
A recording system has a listener processor for automatically capturing events involving computer applications during a dialogue involving the user of the computer. The system generates a visual transcript of events on a timeline. It automatically detects start of a dialogue and proceeds to detect events and determines if they are configured as transcript events, before detecting end of the dialogue. The system may associate dialogue events with audio clips, using meta tags.
RECORDING SYSTEM FOR GENERATING A TRANSCRIPT OF A DIALOGUE
A recording system has a listener processor for automatically capturing events involving computer applications during a dialogue involving the user of the computer. The system generates a visual transcript of events on a timeline. It automatically detects start of a dialogue and proceeds to detect events and determines if they are configured as transcript events, before detecting end of the dialogue. The system may associate dialogue events with audio clips, using meta tags.
Systems, methods, devices and apparatuses for detecting facial expression
A system, method and apparatus for detecting facial expressions according to EMG signals.
Message and user profile indications in speech-based systems
A speech-based system utilizes a speech interface device located in the home of a user. The system may interact with different users based on different user profiles. The system may include messaging services that generate and/or provide messages to the user through the speech interface device. The speech interface device may have indicators that are capable of being illuminated in different colors. To notify a user regarding the currently active user profile, each user profile is associated with a different color and the color of the active profile is displayed on the speech interface device when the user is interacting with the system. To notify the user regarding awaiting messages, different types of messages are associated with different colors and the colors of the message types of waiting messages are displayed on the speech interface whenever the user is not interacting with the system.
Message and user profile indications in speech-based systems
A speech-based system utilizes a speech interface device located in the home of a user. The system may interact with different users based on different user profiles. The system may include messaging services that generate and/or provide messages to the user through the speech interface device. The speech interface device may have indicators that are capable of being illuminated in different colors. To notify a user regarding the currently active user profile, each user profile is associated with a different color and the color of the active profile is displayed on the speech interface device when the user is interacting with the system. To notify the user regarding awaiting messages, different types of messages are associated with different colors and the colors of the message types of waiting messages are displayed on the speech interface whenever the user is not interacting with the system.
Processing speech signals of a user to generate a visual representation of the user
A computing system for generating image data representing a speaker's face includes a detection device configured to route data representing a voice signal to one or more processors and a data processing device comprising the one or more processors configured to generate a representation of a speaker that generated the voice signal in response to receiving the voice signal. The data processing device executes a voice embedding function to generate a feature vector from the voice signal representing one or more signal features of the voice signal, maps a signal feature of the feature vector to a visual feature of the speaker by a modality transfer function specifying a relationship between the visual feature of the speaker and the signal feature of the feature vector; and generates a visual representation of at least a portion of the speaker based on the mapping, the visual representation comprising the visual feature.
APPARATUS, METHOD, AND COMPUTER PROGRAM FOR PROVIDING LIP-SYNC VIDEO AND APPARATUS, METHOD, AND COMPUTER PROGRAM FOR DISPLAYING LIP-SYNC VIDEO
Provided is a lip-sync video providing apparatus for providing a video in which a voice and lip shapes are synchronized. The lip-sync video providing apparatus is configured to obtain a template video including at least one frame and depicting a target object, obtain a target voice to be used as a voice of the target object, generate a lip image corresponding to the voice for each frame of the template video by using a trained first artificial neural network, and generate lip-sync data including frame identification information of a frame in the template video, the lip image, and position information regarding the lip image in a frame in the template video.
END-TO-END SPEECH CONVERSION
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for end to end speech conversion are disclosed. In one aspect, a method includes the actions of receiving first audio data of a first utterance of one or more first terms spoken by a user. The actions further include providing the first audio data as an input to a model that is configured to receive first given audio data in a first voice and output second given audio data in a synthesized voice without performing speech recognition on the first given audio data. The actions further include receiving second audio data of a second utterance of the one or more first terms spoken in the synthesized voice. The actions further include providing, for output, the second audio data of the second utterance of the one or more first terms spoken in the synthesized voice.