G10L2025/786

Voice Recognition Accuracy in High Noise Conditions

Systems and methods for voice recognition determine energy levels for speech and noise and generate adaptive thresholds based on the determined energy levels. The adaptive thresholds are applied to determine the presence of speech and to generate noise-dependent triggers for indicating the presence of speech during high-noise conditions. In an embodiment, the signal energy is averaged in the presence of speech and in the presence of background noise. Audio energy calculations may be made by averaging via a sliding window or via a memory filter.

VOICE ACTIVITY DETECTION

Methods, systems, and computer-readable media are provided for detecting voice activity. A primary signal is configured to include a speech component representative of a user's speech when the user is speaking in a detection region, or environment. A reference signal is configured to include a reduced speech component relative to the primary signal. One or more conditions of the detection region is/are detected, and a threshold value is selected (or, optionally, calculated) based upon the detected condition(s). The primary signal is compared to the reference signal, with respect to the selected threshold value. An indication of whether the user is speaking is selectively output, based at least in part upon the comparison.

CLASSIFICATION OF AUDIO AS ORIGNATING FROM A HUMAN SOURCE OR A NON-HUMAN TO AVOID FALSE WAKE-WORD DETECTION
20220157333 · 2022-05-19 ·

An audio processing device that determines a correlation between a location of a source of the audio signal associated with the audio command, and a human positioned at the location of the source of the audio signal associated with the audio command. The audio command is identified as originating from the human positioned at the location based on the position data and the location of the source of the audio signal associated with the audio command correlating. Based on identifying the audio command as originating from the human positioned at the location, the audio command is executed.

VOICE ACTIVITY DETECTION (VAD) BASED ON MULTIPLE INDICIA

In an example, a machine-implemented method for detecting voice activity may include receiving a digital representation of an audio signal. The method may also include applying a first stage which may include determining a first frequency-domain indicator from the digital representation of the audio signal to identify a candidate speech duration. The method may also include applying a second stage which may include determining at least one of a mel-frequency cepstral (MFC) indicator or a pitch indicator from the digital representation of the audio signal to assess whether the identified candidate speech duration contains speech.

Voice Activty Detection Using Zero Crossing Detection
20220130410 · 2022-04-28 ·

A first VAD system outputs a pulse stream for zero crossings in an audio signal. The pulse density of the pulse stream is evaluated to identify speech. The audio signal may have noise added to it before evaluating zero crossings. A second VAD system rectifies each audio signal sample and processes each rectified sample by updating a first statistic and evaluating the rectified sample per a first threshold condition that is a function of the first statistic. Rectified samples meeting the first threshold condition may be used to update a second statistic and the rectified sample evaluated per a second threshold condition that is a function of the second statistic. Rectified samples meeting the second threshold condition may be used to update a third statistic. The audio signal sample may be selected as speech if the second statistic is less than a downscaled third statistic.

METHOD AND SYSTEM FOR IMPROVING ESTIMATION OF SOUND SOURCE LOCALIZATION BY USING INDOOR POSITION DATA FROM WIRELESS SYSTEM
20220130416 · 2022-04-28 · ·

A method, an electronic device or customer-premise equipment, and a computer readable medium are disclosed for estimating a sound source. The method includes detecting, on an electronic device, voice data from a space; calculating, on the electronic device, an estimated voice source location from the detected voice data; detecting, on the electronic device, wireless location data from a positioning system within the space; calculating, on the electronic device, a probability of a user within one or more regions from the calculated estimated voice source location and the detected wireless location data, the one or more regions being regions of a plurality of regions within the space; and steering, from the electronic device, a microphone array for voice detection toward the one or more regions having the probability of the user within the one or more regions.

SYSTEMS AND METHODS FOR DYNAMICALLY ADJUSTING A LISTENING TIME OF A VOICE ASSISTANT DEVICE

A method of adjusting a predefined listening time of a voice assistant device includes receiving an audio input; extracting at least one of a speech component and a non-speech artifact from the audio input; determining a user breathing pattern based on the at least one of the speech component and the non-speech artifact; identifying at least one attribute that impact the user breathing pattern based on at least one non-speech component, captured from an environment and the voice assistant device; determining, after detecting a pause in the audio input, whether a user's intention is to continue a conversation based on an analysis of the user breathing pattern and the at least one attribute; and dynamically adjusting the predefined listening time of the voice assistant device to continue listening for voice commands in the conversation based on a determination that the user's intention is to continue the conversation.

Discontinuous Transmission on Short-Range Packed-Based Radio Links
20210360735 · 2021-11-18 · ·

Described herein are systems and methods for discontinuous transmission on short-range packet-based radio links. As one option, during a communication session with a far-end system on a short-range packet-based radio link, an audio stream received on an audio line in is monitored for voice-based signals. Based on monitoring the audio stream for the voice-based signals, a voice activity estimation signal is generated. While the voice activity estimation signal exceeds a predetermined threshold, one or more voice packets are generated based on the audio stream, and transmitted to the far-end system at one or more times. Also, in response to determining that the voice activity estimation signal is below the predetermined threshold, one or more zero-payload packets are transmitted to the far-end system at one or more subsequent times.

Adaptive Energy Limiting for Transient Noise Suppression
20220122625 · 2022-04-21 · ·

The present disclosure describes aspects of adaptive energy limiting for transient noise suppression. In some aspects, an adaptive energy limiter sets a limiter ceiling for an audio signal to full scale and receives a portion of the audio signal. For the portion of the audio signal, the adaptive energy limiter determines a maximum amplitude and evaluates the portion with a neural network to provide a voice likelihood estimate. Based on the maximum amplitude and the voice likelihood estimate, the adaptive energy limiter determines that the portion of the audio signal includes noise. In response to determining that the portion of the audio signal includes noise, the adaptive energy limiter decreases the limiter ceiling and provides the limiter ceiling to a limiter module effective to limit an amount of energy of the audio signal. This may be effective to prevent audio signals from carrying full energy transient noise into conference audio.

Adaptive energy limiting for transient noise suppression
11217262 · 2022-01-04 · ·

The present disclosure describes aspects of adaptive energy limiting for transient noise suppression. In some aspects, an adaptive energy limiter sets a limiter ceiling for an audio signal to full scale and receives a portion of the audio signal. For the portion of the audio signal, the adaptive energy limiter determines a maximum amplitude and evaluates the portion with a neural network to provide a voice likelihood estimate. Based on the maximum amplitude and the voice likelihood estimate, the adaptive energy limiter determines that the portion of the audio signal includes noise. In response to determining that the portion of the audio signal includes noise, the adaptive energy limiter decreases the limiter ceiling and provides the limiter ceiling to a limiter module effective to limit an amount of energy of the audio signal. This may be effective to prevent audio signals from carrying full energy transient noise into conference audio.