Patent classifications
G10L15/25
Method of performing function of electronic device and electronic device using same
An electronic device includes: a camera; a microphone; a display; a memory; and a processor configured to receive an input for activating an intelligent agent service from a user while at least one application is executed, identify context information of the electronic device, control to acquire image information of the user through the camera, based on the identified context information, detect movement of a user's lips included in the acquired image information to recognize a speech of the user, and perform a function corresponding to the recognized speech.
SPEECH TRANSCRIPTION FROM FACIAL SKIN MOVEMENTS
Systems and methods are disclosed for determining textual transcription from minute facial skin movements. In one implementation, a system may include at least one coherent light source, at least one sensor configured to receive light reflections from the at least one coherent light source; and a processor configured to control the at least one coherent light source to illuminate a region of a face of a user. The processor may receive from the at least one sensor, reflection signals indicative of coherent light reflected from the face in a time interval. The reflection signals may be analyzed to determine minute facial skin movements in the time interval. Then, based on the determined minute facial skin movements in the time interval, the processor may determine a sequence of words associated with the minute facial skin movements, and output a textual transcription corresponding with the determined sequence of words.
Methods and systems for processing audio signals containing speech data
Methods and systems for processing audio signals containing speech data are disclosed. Biometric data associated with at least one speaker are extracted from an audio input. A correspondence is determined between the extracted biometric data and stored biometric data associated with a consenting user profile, where a consenting user profile is a user profile indicates consent to store biometric data. If no correspondence is determined, the speech data is discarded, optionally after having been processed.
Methods and systems for processing audio signals containing speech data
Methods and systems for processing audio signals containing speech data are disclosed. Biometric data associated with at least one speaker are extracted from an audio input. A correspondence is determined between the extracted biometric data and stored biometric data associated with a consenting user profile, where a consenting user profile is a user profile indicates consent to store biometric data. If no correspondence is determined, the speech data is discarded, optionally after having been processed.
Lip-language identification method and apparatus, and augmented reality (AR) device and storage medium which identifies an object based on an azimuth angle associated with the AR field of view
A lip-language identification method and an apparatus thereof, an augmented reality device and a storage medium. The lip-language identification method includes: acquiring a sequence of face images for an object to be identified; performing lip-language identification based on a sequence of face images so as to determine semantic information of speech content of the object to be identified corresponding to lip actions in a face image; and outputting the semantic information.
Lip-language identification method and apparatus, and augmented reality (AR) device and storage medium which identifies an object based on an azimuth angle associated with the AR field of view
A lip-language identification method and an apparatus thereof, an augmented reality device and a storage medium. The lip-language identification method includes: acquiring a sequence of face images for an object to be identified; performing lip-language identification based on a sequence of face images so as to determine semantic information of speech content of the object to be identified corresponding to lip actions in a face image; and outputting the semantic information.
Rescoring Automatic Speech Recognition Hypotheses Using Audio-Visual Matching
A method (400) includes receiving audio data (112) corresponding to an utterance (101) spoken by a user (10), receiving video data (114) representing motion of lips of the user while the user was speaking the utterance, and obtaining multiple candidate transcriptions (135) for the utterance based on the audio data. For each candidate transcription of the multiple candidate transcriptions, the method also includes generating a synthesized speech representation (145) of the corresponding candidate transcription and determining an agreement score (155) indicating a likelihood that the synthesized speech representation matches the motion of the lips of the user while the user speaks the utterance. The method also includes selecting one of the multiple candidate transcriptions for the utterance as a speech recognition output (175) based on the agreement scores determined for the multiple candidate transcriptions for the utterance.
Rescoring Automatic Speech Recognition Hypotheses Using Audio-Visual Matching
A method (400) includes receiving audio data (112) corresponding to an utterance (101) spoken by a user (10), receiving video data (114) representing motion of lips of the user while the user was speaking the utterance, and obtaining multiple candidate transcriptions (135) for the utterance based on the audio data. For each candidate transcription of the multiple candidate transcriptions, the method also includes generating a synthesized speech representation (145) of the corresponding candidate transcription and determining an agreement score (155) indicating a likelihood that the synthesized speech representation matches the motion of the lips of the user while the user speaks the utterance. The method also includes selecting one of the multiple candidate transcriptions for the utterance as a speech recognition output (175) based on the agreement scores determined for the multiple candidate transcriptions for the utterance.
CURIOSITY BASED ACTIVATION AND SEARCH DEPTH
Embodiments of the present invention determine a curiosity of a user based on data received from an electronic device associated with the user, where the data includes audible speech captured from user and one or more facial expressions of the user. Embodiments of the present invention identify a first wavelength for audible speech from the user to initiate a command detection mode based on a plurality of wavelengths associated with a user profile for the user. Embodiments of the present invention identify a topic for the audible speech from the user and responsive to determining an intelligent virtual assistant is an intended recipient based on the topic, suspend an activation word for the intelligent virtual assistant.
CURIOSITY BASED ACTIVATION AND SEARCH DEPTH
Embodiments of the present invention determine a curiosity of a user based on data received from an electronic device associated with the user, where the data includes audible speech captured from user and one or more facial expressions of the user. Embodiments of the present invention identify a first wavelength for audible speech from the user to initiate a command detection mode based on a plurality of wavelengths associated with a user profile for the user. Embodiments of the present invention identify a topic for the audible speech from the user and responsive to determining an intelligent virtual assistant is an intended recipient based on the topic, suspend an activation word for the intelligent virtual assistant.