Patent classifications
H04N21/2368
METHOD AND DEVICE FOR GENERATING SPEECH VIDEO ON BASIS OF MACHINE LEARNING
A device for generating a speech video may include a first encoder to receive a person background image corresponding to a video part of a speech video of a person and extract an image feature vector from the person background image, a second encoder to receive a speech audio signal corresponding to an audio part of the speech video and extract a voice feature vector from the speech audio signal, a combiner to generate a combined vector by combining the image feature vector output from the first encoder and the voice feature vector output from the second encoder, and a decoder to reconstruct the speech video of the person using the combined vector as an input. The person background image input to the first encoder includes a face and an upper body of the person, with a portion related to speech of the person covered with a mask.
Overlay processing method in 360 video system, and device thereof
A 360 image data processing method performed by a 360 video receiving device, according to the present invention, comprises the steps of: receiving 360 image data; acquiring information and metadata on an encoded picture from the 360 image data; decoding the picture on the basis of the information on the encoded picture; and rendering the decoded picture and an overlay on the basis of the metadata, wherein the metadata includes overlay-related metadata, the overlay is rendered on the basis of the overlay-related metadata, and the overlay-related metadata includes information on a region of the overlay.
CONTENT ACCESS DEVICES THAT USE LOCAL AUDIO TRANSLATION FOR CONTENT PRESENTATION
A content access device uses local audio translation for content presentation. The content access device receives video and first audio data associated with a first language. The content access device uses translation software and/or other automated translation services to translate the first audio data to second audio data associated with a second language. The content access device synchronizes the video with the second audio data and outputs the video and the second audio data for presentation. The first audio data may be audio, text, and so on. The second audio data may be output as audio, text, and so on.
Method and apparatus for selection of content from a stream of data
A main stream contains successive content elements of video and/or audio information that encode video and/or audio information at a first data rate. A computation circuit (144) computes main fingerprints from the successive content elements. A reference stream is received having a second data rate lower than the first data rate. The reference stream defines a sequence of the reference fingerprints. A comparator unit (144) compares the main fingerprints with the reference fingerprints. The main stream is monitored for the presence of inserted content elements between original content elements, where the original content elements have main fingerprints that match successive reference fingerprints and the inserted content elements have main fingerprints that do not match reference fingerprints. Rendering of inserted content elements to be skipped. In an embodiment when more than one content element matches only one is rendered. In another embodiment matching is used to control zapping to or from the main stream. In another embodiment matching is used to control linking of separately received mark-up information such as subtitles to points in the main stream.
Method and apparatus for selection of content from a stream of data
A main stream contains successive content elements of video and/or audio information that encode video and/or audio information at a first data rate. A computation circuit (144) computes main fingerprints from the successive content elements. A reference stream is received having a second data rate lower than the first data rate. The reference stream defines a sequence of the reference fingerprints. A comparator unit (144) compares the main fingerprints with the reference fingerprints. The main stream is monitored for the presence of inserted content elements between original content elements, where the original content elements have main fingerprints that match successive reference fingerprints and the inserted content elements have main fingerprints that do not match reference fingerprints. Rendering of inserted content elements to be skipped. In an embodiment when more than one content element matches only one is rendered. In another embodiment matching is used to control zapping to or from the main stream. In another embodiment matching is used to control linking of separately received mark-up information such as subtitles to points in the main stream.
Method and apparatus for efficient delivery and usage of audio messages for high quality of experience
A method and a system for virtual reality, augmented reality, mixed reality, or 360-degree Video environment is disclosed. The system receives Video Streams associated to audio and video scenes to be reproduced and Audio Streams associated to audio and video scenes to be reproduced. There are provided a Video decoder which decodes signal from the Video Stream for the representation of the audio and video scene; an Audio decoder which decodes signal from the Audio Stream for the representation of the audio and video scene to the user; and a region of interest processor deciding, based e.g. on the user's viewport, head orientation, movement data, or metadata, whether an Audio information message is to be reproduced. At the decision, the reproduction of the Audio information message is caused.
Method and apparatus for efficient delivery and usage of audio messages for high quality of experience
A method and a system for virtual reality, augmented reality, mixed reality, or 360-degree Video environment is disclosed. The system receives Video Streams associated to audio and video scenes to be reproduced and Audio Streams associated to audio and video scenes to be reproduced. There are provided a Video decoder which decodes signal from the Video Stream for the representation of the audio and video scene; an Audio decoder which decodes signal from the Audio Stream for the representation of the audio and video scene to the user; and a region of interest processor deciding, based e.g. on the user's viewport, head orientation, movement data, or metadata, whether an Audio information message is to be reproduced. At the decision, the reproduction of the Audio information message is caused.
System for jitter recovery from a transcoder
A system for transcoding a digital video stream using a transcoder includes receiving a digital video stream that includes an input video stream and extracting a first set of presentation time stamps from the input video stream which are stored in a table. The system transcodes the input video stream including the first set of presentation time stamps from an initial set of characteristics to a modified set of characteristics including a second set of presentation time stamps that are different from the first set of presentation time stamps. The system processes the second set of presentation time stamps of the transcoded video stream to determine if the second presentation time stamps include jitter, and based upon determining the second set of presentation time stamps include jitter modifying the second set of presentation time stamps based upon the first set of presentation time stamps in the table.
Playback Modification Based On Proximity
Techniques described herein may involve modification of playback based on the proximity of a user to a playback device. An example technique involves a device determining that a listener is within a given proximity of a first playback device and based on determining that the listener is within the given proximity of the first playback device, causing the first playback device to begin playback of first media and causing a second playback device to modify playback of second media.
Playback Modification Based On Proximity
Techniques described herein may involve modification of playback based on the proximity of a user to a playback device. An example technique involves a device determining that a listener is within a given proximity of a first playback device and based on determining that the listener is within the given proximity of the first playback device, causing the first playback device to begin playback of first media and causing a second playback device to modify playback of second media.