G10L25/69

Method, system, and device for cloud voice quality monitoring

Systems and methods for communications are disclosed. The systems and methods can monitor a cloud-based voice over internet protocol (VoIP) calling system to determine an active call. The systems and methods can also analyze the active call to determine an indication of call quality, the analyzing occurring during the active call. Additionally, the systems and methods can compare the indication of call quality to a quality threshold. The compare can occur during the active call to determine when the active call has a poor call quality. The systems and methods can also report the poor call quality based on the comparing the indication of call quality to the quality threshold.

METHOD AND APPARATUS FOR AUDIO SIGNAL PROCESSING EVALUATION
20220208171 · 2022-06-30 · ·

A method and an apparatus for audio signal processing evaluation are provided. The audio signal processing is performed on a synthesized audio signal to generate a processed audio signal. The synthesized audio signal is generated by adding a secondary signal into a master signal. The master signal is merely a speech signal. The signal processing is related to removing the secondary signal from the synthesized audio signal. The sound characteristics of the processed audio signal and the master signal are obtained, respectively. The sound characteristics include text content, and the text content is generated by performing speech-to-text on the processed audio signal and the master signal. The audio signal processing is evaluated according to the compared result between the sound characteristics of the processed audio signal and the master signal. The compared result includes the correctness of the text content of the processed audio signal relative to the master signal.

Streaming voice conversion method and apparatus and computer readable storage medium using the same

The present disclosure provides a streaming voice conversion method as well as an apparatus and a computer readable storage medium using the same. The method includes: obtaining to-be-converted voice data; partitioning the to-be-converted voice data in an order of data obtaining time as a plurality of to-be-converted partition voices, where the to-be-converted partition voice data carries a partition mark; performing a voice conversion on each of the to-be-converted partition voices to obtain a converted partition voice, where the converted partition voice carries a partition mark; performing a partition restoration on each of the converted partition voices to obtain a restored partition voice, where the restored partition voice carries a partition mark; and outputting each of the restored partition voices according to the partition mark carried by the restored partition voice. In this manner, the response time is shortened, and the conversion speed is improved.

Streaming voice conversion method and apparatus and computer readable storage medium using the same

The present disclosure provides a streaming voice conversion method as well as an apparatus and a computer readable storage medium using the same. The method includes: obtaining to-be-converted voice data; partitioning the to-be-converted voice data in an order of data obtaining time as a plurality of to-be-converted partition voices, where the to-be-converted partition voice data carries a partition mark; performing a voice conversion on each of the to-be-converted partition voices to obtain a converted partition voice, where the converted partition voice carries a partition mark; performing a partition restoration on each of the converted partition voices to obtain a restored partition voice, where the restored partition voice carries a partition mark; and outputting each of the restored partition voices according to the partition mark carried by the restored partition voice. In this manner, the response time is shortened, and the conversion speed is improved.

DYNAMIC VIRTUAL ASSISTANT SPEECH MODULATION

A method, computer system, and a computer program product for dynamic speech modulation is provided. The present invention may include transmitting a first response to a received command. The present invention may include determining the first response is not understood by a user. The present invention may include transmitting a second response to the received command.

Self-Supervised Speech Representations for Fake Audio Detection
20220172739 · 2022-06-02 · ·

A method for determining synthetic speech includes receiving audio data characterizing speech in audio data obtained by a user device. The method also includes generating, using a trained self-supervised model, a plurality of audio features vectors each representative of audio features of a portion of the audio data. The method also includes generating, using a shallow discriminator model, a score indicating a presence of synthetic speech in the audio data based on the corresponding audio features of each audio feature vector of the plurality of audio feature vectors. The method also includes determining whether the score satisfies a synthetic speech detection threshold. When the score satisfies the synthetic speech detection threshold, the method includes determining that the speech in the audio data obtained by the user device comprises synthetic speech.

Self-Supervised Speech Representations for Fake Audio Detection
20220172739 · 2022-06-02 · ·

A method for determining synthetic speech includes receiving audio data characterizing speech in audio data obtained by a user device. The method also includes generating, using a trained self-supervised model, a plurality of audio features vectors each representative of audio features of a portion of the audio data. The method also includes generating, using a shallow discriminator model, a score indicating a presence of synthetic speech in the audio data based on the corresponding audio features of each audio feature vector of the plurality of audio feature vectors. The method also includes determining whether the score satisfies a synthetic speech detection threshold. When the score satisfies the synthetic speech detection threshold, the method includes determining that the speech in the audio data obtained by the user device comprises synthetic speech.

METHOD AND APPARATUS FOR DESIGNING AND TESTING AUDIO CODEC BY USING WHITE NOISE MODELING

Provided is a method and apparatus for designing and testing an audio codec using quantization based on white noise modeling. A neural network-based audio encoder design method includes generating a quantized latent vector and a reconstructed signal corresponding to an input signal by using a white noise modeling-based quantization process, computing a total loss for training a neural network-based audio codec, based on the input signal, the reconstruction signal, and the quantized latent vector, training the neural network-based audio codec by using the total loss, and validating the trained neural network-based audio codec to select the best neural network-based audio codec.

METHOD AND APPARATUS FOR DESIGNING AND TESTING AUDIO CODEC BY USING WHITE NOISE MODELING

Provided is a method and apparatus for designing and testing an audio codec using quantization based on white noise modeling. A neural network-based audio encoder design method includes generating a quantized latent vector and a reconstructed signal corresponding to an input signal by using a white noise modeling-based quantization process, computing a total loss for training a neural network-based audio codec, based on the input signal, the reconstruction signal, and the quantized latent vector, training the neural network-based audio codec by using the total loss, and validating the trained neural network-based audio codec to select the best neural network-based audio codec.

Managing jitter buffer length for improved audio quality

A technique for managing real-time communications includes generating, during a communication session between at least a first computing device and a second computing device over a computer network, multiple audio factors of the communication session, each of the audio factors being susceptible to degradation in a way that affects audio quality of the communication session. The technique further includes combining the audio factors to produce an overall measure of audio quality and taking remedial action to improve the overall measure of audio quality by adjusting a setting on the first computing device.