Patent classifications
G10L19/0018
TECHNOLOGIES FOR ENHANCING AUDIO QUALITY DURING LOW-QUALITY CONNECTION CONDITIONS
Techniques for teleconferencing with enhanced audio during low-quality connection conditions are disclosed. In the illustrative embodiment, a user of a compute device is teleconferencing with users of one or more remote compute devices. The compute device monitors a connection quality with a remote compute device. If the connection quality drops below a threshold, risking gaps in the audio data, the compute device generates speech code data that can be used to fill in the gaps in the audio data. The remote compute device can use the speech code data to augment the audio data by using a voice model to create additional audio data based on the speech code data.
DURATION INFORMED ATTENTION NETWORK (DURIAN) FOR AUDIO-VISUAL SYNTHESIS
A method and apparatus include receiving a text input that includes a sequence of text components. Respective temporal durations of the text components are determined using a duration model. A spectrogram frame is generated based on the duration model. An audio waveform is generated based on the spectrogram frame. Video information is generated based on the audio waveform. The audio waveform is provided as an output along with a corresponding video.
Virtualized speech in a distributed network environment
Aspects of the disclosure relate to various systems and techniques that provide for a method and apparatus for transmitting speech as text to a remote server and converting the text stream back to speech for delivery to a remote application. For example, a person, through workspace virtualization, is accessing a remote application that accepts speech as its input. The user, using a microphone, would speak into the microphone where the speech would be converted into text with a local speech-to-text converter. The text version of speech is sent to a remote server, which converts the text back to speech using a remote server based text-to-speech converter where the reconstructed speech is usable as input to a remote application or device.
Neural network model for generation of compressed haptic actuator signal from audio input
A method comprises inputting an audio signal into a machine learning circuit to compress the audio signal into a sequence of actuator signals. The machine learning circuit being trained by: receiving a training set of acoustic signals and pre-processing the training set of acoustic signals into pre-processed audio data. The pre-processed audio data including at least a spectrogram. The training further includes training the machine learning circuit using the pre-processed audio data. The neural network has a cost function based on a reconstruction error and a plurality of constraints. The machine learning circuit generates a sequence of haptic cues corresponding to the audio input. The sequence of haptic cues is transmitted to a plurality of cutaneous actuators to generate a sequence of haptic outputs.
CAPTION ASSISTED CALLING TO MAINTAIN CONNECTION IN CHALLENGING NETWORK CONDITIONS
Systems are provided for managing and coordinating STT/TTS systems and the communications between these systems when they are connected in online meetings and for mitigating connectivity issues that may arise during the online meetings to provide a seamless and reliable meeting experience with either live captions and/or rendered audio. Initially, online meeting communications are transmitted over a lossy connectionless type protocol/channel. Then, in response to detected connectivity problems with one or more systems involved in the online meeting, which can cause jitter or packet loss, for example, an instruction is dynamically generated and processed for causing one or more of the connected systems to transmit and/or process the online meeting content with a more reliable connection/protocol, such as a connection-oriented protocol. Codecs at the systems are used, when needed to convert speech to text with related speech attribute information and to convert text to speech.
Method for communicating a non-speech message as audio
A method is provided for communicating a non-speech message as audio from a first device to a second device such that information can be passed between the first and second device. The method includes: encoding the non-speech message as a dissimilar speech message having a plurality of phonemes; transmitting the speech message over one or more audio communications channels from the first device; receiving the speech message at the second device; recognizing the speech message; and decoding the dissimilar speech message to the non-speech message. By using existing audio functionality, and the increasingly more reliable voice recognition applications, an improved method is provided for sharing complex data messages using commonly available communication channels.
Audio recording optimization for calls serviced by an artificial intelligence agent
Artificial agents utilized for voice interactions continue to improve in their capacity to conduct more sophisticated interactions. Rather than just presenting a limited set of options, artificial agents are continuing to narrow the gap between generated speech and natural human speech. A requirement is often in place that spoken interactions be recorded, however, storing speech, even with data compression, is a resource-demanding task. Generated speech may be provided from content, such as text, and speech data. By recording an identifier of the content and associated speech data, storage processing and space requirements can be greatly reduced. Playback may be provided from a waveform of audio provided by the human participant and by selecting the content associated with the content identifier and generating speech of the content utilizing settings provided by the speech data.
Real time digital voice communication method
A communication system includes at least one first device and at least one second device which are linked in a manner that enables data transfer with each other. The first device enables the speech signal that it receives as the input to be expressed in terms of the energy functions representing the energy patterns, information functions representing the information patterns and the noise functions of the frames of the real speech samples; and transfers the indexes of these functions in the database and the frame gain factor of each frame to the second device. The second device finds the functions via the indexes from the copy database which is a copy of the database and reconstructs the speech signal by these functions and the frame gain factor, enabling it to be provided as the voice output.
Phase reconstruction in a speech decoder
Innovations in phase quantization during speech encoding and phase reconstruction during speech decoding are described. For example, to encode a set of phase values, a speech encoder omits higher-frequency phase values and/or represents at least some of the phase values as a weighted sum of basis functions. Or, as another example, to decode a set of phase values, a speech decoder reconstructs at least some of the phase values using a weighted sum of basis functions and/or reconstructs lower-frequency phase values then uses at least some of the lower-frequency phase values to synthesize higher-frequency phase values. In many cases, the innovations improve the performance of a speech codec in low bitrate scenarios, even when encoded data is delivered over a network that suffers from insufficient bandwidth or transmission quality problems.
Caption assisted calling to maintain connection in challenging network conditions
Systems are provided for managing and coordinating STT/TTS systems and the communications between these systems when they are connected in online meetings and for mitigating connectivity issues that may arise during the online meetings to provide a seamless and reliable meeting experience with either live captions and/or rendered audio. Initially, online meeting communications are transmitted over a lossy connectionless type protocol/channel. Then, in response to detected connectivity problems with one or more systems involved in the online meeting, which can cause jitter or packet loss, for example, an instruction is dynamically generated and processed for causing one or more of the connected systems to transmit and/or process the online meeting content with a more reliable connection/protocol, such as a connection-oriented protocol. Codecs at the systems are used, when needed to convert speech to text with related speech attribute information and to convert text to speech.