G10L15/25

Lip language recognition method and mobile terminal using sound and silent modes

A lip language recognition method, applied to a mobile terminal having a sound mode and a silent mode, includes: training a deep neural network in the sound mode; collecting a user's lip images in the silent mode; and identifying content corresponding to the user's lip images with the deep neural network trained in the sound mode. The method further includes: switching from the sound mode to the silent mode when a privacy need of the user arises.

WEARABLE APPARATUS AND METHODS FOR APPROVING TRANSCRIPTION AND/OR SUMMARY

System and methods for processing audio signals are disclosed. In one implementation, a system may include a wearable device. The wearable device may include a microphone having an audio sensor configured to capture the audio signals from an environment of the user; and a processor. The processor may be programmed to receive the audio signals captured by the microphone; analyze the audio signals to generate a transcription; generate a summary of the transcription; cause the summary to be displayed to the user; receive a confirmation input from the user indicating that the displayed summary is correct; and cause the displayed summary to be stored.

WEARABLE APPARATUS AND METHODS FOR APPROVING TRANSCRIPTION AND/OR SUMMARY

System and methods for processing audio signals are disclosed. In one implementation, a system may include a wearable device. The wearable device may include a microphone having an audio sensor configured to capture the audio signals from an environment of the user; and a processor. The processor may be programmed to receive the audio signals captured by the microphone; analyze the audio signals to generate a transcription; generate a summary of the transcription; cause the summary to be displayed to the user; receive a confirmation input from the user indicating that the displayed summary is correct; and cause the displayed summary to be stored.

System and Method for Automated Digital Twin Behavior Modeling for Multimodal Conversations
20230099393 · 2023-03-30 · ·

Methods and systems for a multimodal conversational system are described. A method for interactive multimodal conversation includes parsing multimodal conversation from a physical human for content, recognizing and sensing one or more multimodal content from the parsed content, identifying verbal and non-verbal behavior of the physical human from the one or more multimodal content, generating learned patterns from the identified verbal and non-verbal behavior of the physical human, training a multimodal dialog manager with and using the learned patterns to provide responses to end-user multimodal conversations and queries, and training a virtual human clone of the physical human with interactive verbal and non-verbal behaviors of the physical human, wherein appropriate interactive verbal and non-verbal behaviors are provided by the virtual human clone when providing the responses to the end-user multimodal conversations and queries.

System and Method for Automated Digital Twin Behavior Modeling for Multimodal Conversations
20230099393 · 2023-03-30 · ·

Methods and systems for a multimodal conversational system are described. A method for interactive multimodal conversation includes parsing multimodal conversation from a physical human for content, recognizing and sensing one or more multimodal content from the parsed content, identifying verbal and non-verbal behavior of the physical human from the one or more multimodal content, generating learned patterns from the identified verbal and non-verbal behavior of the physical human, training a multimodal dialog manager with and using the learned patterns to provide responses to end-user multimodal conversations and queries, and training a virtual human clone of the physical human with interactive verbal and non-verbal behaviors of the physical human, wherein appropriate interactive verbal and non-verbal behaviors are provided by the virtual human clone when providing the responses to the end-user multimodal conversations and queries.

Portable terminal device and information processing system
11487502 · 2022-11-01 · ·

A portable terminal device in an information processing system and method includes a camera and a microphone. Data of obtained images and voice are transmitted to a server that identifies operations to be executed based on the received voice and image data. The server transmits an identification of one or more results of the plurality of operations to the portable terminal device. When the portable terminal device receives only one result from the server, an operation corresponding to the one result is executed, and when a plurality of results is received, the portable terminal device displays information corresponding to the plurality of results as candidates. Additional voice is captured for selecting one of the plurality of results during the displaying of the information. A determination of one result from the plurality of results is made based on the captured voice, and an operation corresponding to the determined result is executed.

Portable terminal device and information processing system
11487502 · 2022-11-01 · ·

A portable terminal device in an information processing system and method includes a camera and a microphone. Data of obtained images and voice are transmitted to a server that identifies operations to be executed based on the received voice and image data. The server transmits an identification of one or more results of the plurality of operations to the portable terminal device. When the portable terminal device receives only one result from the server, an operation corresponding to the one result is executed, and when a plurality of results is received, the portable terminal device displays information corresponding to the plurality of results as candidates. Additional voice is captured for selecting one of the plurality of results during the displaying of the information. A determination of one result from the plurality of results is made based on the captured voice, and an operation corresponding to the determined result is executed.

SPEECH INTERPRETATION BASED ON ENVIRONMENTAL CONTEXT
20230035941 · 2023-02-02 ·

Systems and processes for speech interpretation based on environmental context are provided. For example, a user gaze direction is detected, and a speech input is received from a first user of the electronic device. In accordance with a determination that the user gaze is directed at a digital assistant object, the speech input is processed by the digital assistant. In accordance with a determination that the user gaze is not directed at a digital assistant object, contextual information associated with the electronic device is obtained, wherein the contextual information includes speech from a second user. Determination is made whether the speech input is directed to a digital assistant of the electronic device. In accordance with a determination that the speech input is directed to a digital assistant of the electronic device, the speech input is processed by the digital assistant.

SPEECH INTERPRETATION BASED ON ENVIRONMENTAL CONTEXT
20230035941 · 2023-02-02 ·

Systems and processes for speech interpretation based on environmental context are provided. For example, a user gaze direction is detected, and a speech input is received from a first user of the electronic device. In accordance with a determination that the user gaze is directed at a digital assistant object, the speech input is processed by the digital assistant. In accordance with a determination that the user gaze is not directed at a digital assistant object, contextual information associated with the electronic device is obtained, wherein the contextual information includes speech from a second user. Determination is made whether the speech input is directed to a digital assistant of the electronic device. In accordance with a determination that the speech input is directed to a digital assistant of the electronic device, the speech input is processed by the digital assistant.

End-to-end multi-speaker audio-visual automatic speech recognition
11615781 · 2023-03-28 · ·

A singe audio-visual automated speech recognition model for transcribing speech from audio-visual data includes an encoder frontend and a decoder. The encoder includes an attention mechanism configured to receive an audio track of the audio-visual data and a video portion of the audio-visual data. The video portion of the audio-visual data includes a plurality of video face tracks each associated with a face of a respective person. For each video face track of the plurality of video face tracks, the attention mechanism is configured to determine a confidence score indicating a likelihood that the face of the respective person associated with the video face tack includes a speaking face of the audio track. The decoder is configured to process the audio track and the video face track of the plurality of video face tracks associated with the highest confidence score to determine a speech recognition result of the audio track.