G10L21/055

Presentation of communications
11482240 · 2022-10-25 · ·

A method to present communications is provided. The method may include obtaining, at a device, a request from a user to play back a stored message that includes audio. In response to obtaining the request, the method may include directing the audio of the message to a transcription system from the device. In these and other embodiments, the transcription system may be configured to generate text that is a transcription of the audio in real-time. The method may further include obtaining, at the device, the text from the transcription system and presenting, by the device, the text generated by the transcription system in real-time. In response to obtaining the text from the transcription system, the method may also include presenting, by the device, the audio such that the text as presented is substantially aligned with the audio.

Presentation of communications
11482240 · 2022-10-25 · ·

A method to present communications is provided. The method may include obtaining, at a device, a request from a user to play back a stored message that includes audio. In response to obtaining the request, the method may include directing the audio of the message to a transcription system from the device. In these and other embodiments, the transcription system may be configured to generate text that is a transcription of the audio in real-time. The method may further include obtaining, at the device, the text from the transcription system and presenting, by the device, the text generated by the transcription system in real-time. In response to obtaining the text from the transcription system, the method may also include presenting, by the device, the audio such that the text as presented is substantially aligned with the audio.

Network Microphone Devices with Automatic Do Not Disturb Actuation Capabilities
20230122316 · 2023-04-20 ·

Embodiments disclosed herein include networked microphone devices (NMD) determining whether a Do Not Disturb (DND) feature should be activated, in response to determining that the DND feature should be activated, activating the DND feature. In some embodiments, the NMD determines whether to activate the DND feature based on various configuration and operational states. And in some embodiments, activating the DND feature includes activating the DND feature includes activating the DND feature at one or more additional NMDs based on the configuration and operational state of the NMD and the one or more additional NMDs.

Network Microphone Devices with Automatic Do Not Disturb Actuation Capabilities
20230122316 · 2023-04-20 ·

Embodiments disclosed herein include networked microphone devices (NMD) determining whether a Do Not Disturb (DND) feature should be activated, in response to determining that the DND feature should be activated, activating the DND feature. In some embodiments, the NMD determines whether to activate the DND feature based on various configuration and operational states. And in some embodiments, activating the DND feature includes activating the DND feature includes activating the DND feature at one or more additional NMDs based on the configuration and operational state of the NMD and the one or more additional NMDs.

Text-driven editor for audio and video editing

The disclosed technology is a system and computer-implemented method for assembling and editing a video program from spoken words or soundbites. The disclosed technology imports source audio/video clips and any of multiple formats. Spoken audio is transcribed into searchable text. The text transcript is synchronized to the video track by timecode markers. Each spoken word corresponds to a timecode marker, which in turn corresponds to a video frame or frames. Using word processing operations and text editing functions, a user selects video segments by selecting corresponding transcribed text segments. By selecting text and arranging that text, a corresponding video program is assembled. The selected video segments are assembled on a timeline display in any chosen order by the user. The sequence of video segments may be reordered and edited, as desired, to produce a finished video program for export.

Text-driven editor for audio and video editing

The disclosed technology is a system and computer-implemented method for assembling and editing a video program from spoken words or soundbites. The disclosed technology imports source audio/video clips and any of multiple formats. Spoken audio is transcribed into searchable text. The text transcript is synchronized to the video track by timecode markers. Each spoken word corresponds to a timecode marker, which in turn corresponds to a video frame or frames. Using word processing operations and text editing functions, a user selects video segments by selecting corresponding transcribed text segments. By selecting text and arranging that text, a corresponding video program is assembled. The selected video segments are assembled on a timeline display in any chosen order by the user. The sequence of video segments may be reordered and edited, as desired, to produce a finished video program for export.

Remote visualization of real-time three-dimensional (3D) facial animation with synchronized voice

Described herein are methods and systems for remote visualization of real-time three-dimensional (3D) facial animation with synchronized voice. A sensor captures frames of a face of a person, each frame comprising color images of the face, depth maps of the face, voice data associated with the person, and a timestamp. The sensor generates a 3D face model of the person using the depth maps. A computing device receives the frames of the face and the 3D face model. The computing device preprocesses the 3D face model. For each frame, the computing device: detects facial landmarks using the color images; matches the 3D face model to the depth maps using non-rigid registration; updates a texture on a front part of the 3D face model using the color images; synchronizes the 3D face model with a segment of the voice data using the timestamp; and transmits the synchronized 3D face model and voice data to a remote device.

FRAGMENT-ALIGNED AUDIO CODING

Audio video synchronization and alignment or alignment of audio to some other external clock are rendered more effective or easier by treating fragment grid and frame grid as independent values, but, nevertheless, for each fragment the frame grid is aligned to the respective fragment's beginning. A compression effectiveness lost may be kept low when appropriately selecting the fragment size. On the other hand, the alignment of the frame grid with respect to the fragments' beginnings allows for an easy and fragment-synchronized way of handling the fragments in connection with, for example, parallel audio video streaming, bitrate adaptive streaming or the like.

AUDIO SIGNAL PROCESSING DEVICE AND OPERATING METHOD THEREFOR

An audio signal processing method including obtaining a first audio signal by generating a pattern in association with the first audio signal to be output, outputting the first audio signal, receiving, through an external voice input device while the external voice input device is communicatively connected to the audio signal processing device, a second audio signal including the output first audio signal, detecting the pattern from the second audio signal, and synchronizing the second audio signal with the first audio signal based on the pattern detected from the second audio signal and the pattern included in the first audio signal.

AUDIO SIGNAL PROCESSING DEVICE AND OPERATING METHOD THEREFOR

An audio signal processing method including obtaining a first audio signal by generating a pattern in association with the first audio signal to be output, outputting the first audio signal, receiving, through an external voice input device while the external voice input device is communicatively connected to the audio signal processing device, a second audio signal including the output first audio signal, detecting the pattern from the second audio signal, and synchronizing the second audio signal with the first audio signal based on the pattern detected from the second audio signal and the pattern included in the first audio signal.