Patent classifications
G10L15/25
VIDEO IMAGE COMPOSITION METHOD AND ELECTRONIC DEVICE
The present disclosure provides a video image composition method including the following steps. A priority level list is obtained, and the priority level list includes multiple priority levels of multiple person identities. Multiple video streams are received. Multiple identity labels corresponding to human face frame images from the video streams are determined. The multiple display levels of the human face frame images are determined according to the identity labels and priority level list. A part of the human face frame images being in speaking status are detected. At least one of the part of the human face frame images being in speaking status is constituted as a main display area of a video image, according to the display levels.
ELECTRONIC DEVICE AND METHOD FOR PROCESSING SPEECH BY CLASSIFYING SPEECH TARGET
Various embodiments of the disclosure provide a method and a device which includes multiple cameras arranged at different positions, multiple microphones arranged at different positions, a memory, and a processor operatively connected to at least one of the multiple cameras, the multiple microphones, and the memory, wherein the processor is configured to: determine, using at least one of the multiple cameras, whether at least one of a user wearing the electronic device or a counterpart having a conversation with the user makes an utterance, configure directivity of at least one of the multiple microphones based on the determination, obtain an audio from at least one of the multiple microphones based on the configured directivity, obtain an image including a mouth shape of the user or the counterpart from at least one of the multiple cameras, and process speech of an utterance target in a different manner based on the obtained audio and the image.
ELECTRONIC DEVICE AND METHOD FOR PROCESSING SPEECH BY CLASSIFYING SPEECH TARGET
Various embodiments of the disclosure provide a method and a device which includes multiple cameras arranged at different positions, multiple microphones arranged at different positions, a memory, and a processor operatively connected to at least one of the multiple cameras, the multiple microphones, and the memory, wherein the processor is configured to: determine, using at least one of the multiple cameras, whether at least one of a user wearing the electronic device or a counterpart having a conversation with the user makes an utterance, configure directivity of at least one of the multiple microphones based on the determination, obtain an audio from at least one of the multiple microphones based on the configured directivity, obtain an image including a mouth shape of the user or the counterpart from at least one of the multiple cameras, and process speech of an utterance target in a different manner based on the obtained audio and the image.
System and method to determine outcome probability of an event based on videos
System and method for determining an outcome probability of an event based on videos are disclosed. The method includes receiving the videos of an event, creating a building block model, extracting one of an audio content, a video content from the videos, analysing extracted content, generating an analysis result, analysing an engagement between speaker and participant of event, generating a data lake comprising a keyword library, computing the outcome probability of the event, enabling the building block model to learn from the data lake and the outcome probability computed and representing the at least one outcome probability in a pre-defined format.
END-TO-END MULTI-SPEAKER AUDIO-VISUAL AUTOMATIC SPEECH RECOGNITION
An audio-visual automated speech recognition model for transcribing speech from audio-visual data includes an encoder frontend and a decoder. The encoder includes an attention mechanism configured to receive an audio track of the audio-visual data and a video portion of the audio-visual data. The video portion of the audio-visual data includes a plurality of video face tracks each associated with a face of a respective person. For each video face track of the plurality of video face tracks, the attention mechanism is configured to determine a confidence score indicating a likelihood that the face of the respective person associated with the video face track includes a speaking face of the audio track. The decoder is configured to process the audio track and the video face track of the plurality of video face tracks associated with the highest confidence score to determine a speech recognition result of the audio track.
Apparatuses and methods for selectively inserting text into a video resume
Aspects relate to apparatuses and methods for selectively inserting text into a video resume. An exemplary apparatus includes a processor and a memory communicatively connected to the processor, the memory containing instructions configuring the processor to receive a video resume from a user, divide the video resume is into temporal sections, acquire a plurality of textual inputs from a user, wherein the plurality of textual inputs pertains to the same user of received video resume, classify the plurality of textual inputs to corresponding temporal sections of the received video resume and display, as a function of the classification, the received video resume with a corresponding plurality of textual inputs.
Digital assistant and a corresponding method for voice-based interactive communication based on detected user gaze indicating attention
Method for voice-based interactive communication using a digital assistant, wherein the method comprises, an attention detection step, in which the digital assistant detects a user attention and as a result is set into a listening mode; a speaker detection step, in which the digital assistant detects the user as a current speaker; a speech sound detection step, in which the digital assistant detects and records speech uttered by the current speaker, which speech sound detection step further comprises a lip movement detection step, in which the digital assistant detects a lip movement of the current speaker; a speech analysis step, in which the digital assistant parses said recorded speech and extracts speech-based verbal informational content from said recorded speech; and a subsequent response step, in which the digital assistant provides feed-back to the user based on said recorded speech.
Systems and methods for machine-generated avatars
Systems and methods are disclosed for creating a machine generated avatar. A machine generated avatar is an avatar generated by processing video and audio information extracted from a recording of a human speaking a reading corpora and enabling the created avatar to be able to say an unlimited number of utterances, i.e., utterances that were not recorded. The video and audio processing consists of the use of machine learning algorithms that may create predictive models based upon pixel, semantic, phonetic, intonation, and wavelets.
Systems and methods for machine-generated avatars
Systems and methods are disclosed for creating a machine generated avatar. A machine generated avatar is an avatar generated by processing video and audio information extracted from a recording of a human speaking a reading corpora and enabling the created avatar to be able to say an unlimited number of utterances, i.e., utterances that were not recorded. The video and audio processing consists of the use of machine learning algorithms that may create predictive models based upon pixel, semantic, phonetic, intonation, and wavelets.
Method of performing function of electronic device and electronic device using same
An electronic device includes: a camera; a microphone; a display; a memory; and a processor configured to receive an input for activating an intelligent agent service from a user while at least one application is executed, identify context information of the electronic device, control to acquire image information of the user through the camera, based on the identified context information, detect movement of a user's lips included in the acquired image information to recognize a speech of the user, and perform a function corresponding to the recognized speech.