G10L2025/937

Detection of live speech
11705109 · 2023-07-18 · ·

A method of detecting live speech comprises: receiving a signal containing speech; obtaining a first component of the received signal in a first frequency band, wherein the first frequency band includes audio frequencies; and obtaining a second component of the received signal in a second frequency band higher than the first frequency band. Then, modulation of the first component of the received signal is detected; modulation of the second component of the received signal is detected; and the modulation of the first component of the received signal and the modulation of the second component of the received signal are compared. It may then be determined that the speech may not be live speech, if the modulation of the first component of the received signal differs from the modulation of the second component of the received signal.

APPARATUS AND METHOD FOR SIGNAL PROCESSING

A signal processing apparatus includes a frequency detector configured to receive a user input including at least one of a vibration input and a user voice, vibrate in response to the received user input, and detect a frequency of the received user input, based on the vibration, and a processor configured to determine a type of the user input received by the frequency detector, based on the frequency detected by the frequency detector, and perform a function corresponding to the user input of the determined type.

Voice processing method, apparatus, electronic device, and storage medium

Provided in the present disclosure are a voice processing method, an apparatus, an electronic device, and a storage medium, the method comprising: detecting the working state of a current call system, and when the working state is a two-end speaking state or a remote-end speaking state, performing compression processing on a subsequent remote-end voice signal, acquiring a near-end voice signal by means of a microphone, performing echo processing on the basis of the near-end voice signal and the compression-processed remote-end voice signal to obtain an echo-processed near-end voice signal and a remaining echo signal, performing non-linear suppression processing on the near-end voice signal and the remaining echo signal, and performing gain control on the suppression-processed near-end voice signal.

SIGNAL PROCESSING DEVICE, SIGNAL PROCESSING METHOD, AND SIGNAL PROCESSING PROGRAM

A signal processing apparatus includes a neural network (“NN”), a sorting unit, and a spatial covariance matrix calculation unit. The NN converts a mixed signal, in which sounds of a plurality of sound sources input by a plurality of channels are mixed, into a separated signal separated into a signal for each sound source as a signal in a time domain as it is and outputs the separated signal. The sorting unit sorts, for the separated signal of each channel output from the NN, the separated signal of each channel such that the plurality of sound sources of a plurality of the separated signals are aligned among the plurality of channels. The spatial covariance matrix calculation unit calculates a spatial covariance matrix corresponding to each sound source in accordance with the separated signal for each channel output from the sorting unit and sorted.

Acoustic voice activity detection (AVAD) for electronic systems

Acoustic Voice Activity Detection (AVAD) methods and systems are described. The AVAD methods and systems, including corresponding algorithms or programs, use microphones to generate virtual directional microphones which have very similar noise responses and very dissimilar speech responses. The ratio of the energies of the virtual microphones is then calculated over a given window size and the ratio can then be used with a variety of methods to generate a VAD signal. The virtual microphones can be constructed using either an adaptive or a fixed filter.

AUDIO RECOGNIZING METHOD, APPARATUS, DEVICE, MEDIUM AND PRODUCT
20230206943 · 2023-06-29 ·

An audio recognizing method, including: performing acoustic feature prediction on the audio to be recognized to obtain first audio prediction result and an acoustic feature reference quantity for predicting an audio recognition result; obtaining second audio prediction result based on the acoustic feature reference quantity; and determining the audio recognition result of the audio to be recognized based on the first audio prediction result and the second audio prediction result, the audio recognition result including unvoiced sound or voiced sound. When determining that the audio is unvoiced sound or voiced sound, the first audio prediction result obtained by performing acoustic feature prediction on the audio to be recognized is used, and the second audio prediction result is obtained in combination with other acoustic feature reference quantities, thereby making the determination result of unvoiced sound or voiced sound of the audio more accurate, to improve the audio quality in speech processing.

VOICE ACTIVITY DETECTION METHOD AND APPARATUS, AND STORAGE MEDIUM
20230186943 · 2023-06-15 ·

Provided are a voice activity detection method and apparatus, an electronic device and a storage medium, which relate to the technical field of voice processing, for example, to the technical field of artificial intelligence and deep learning. The specific implementation solution is described below. A first audio signal is acquired, and a frequency domain feature of the first audio signal is extracted; and the frequency domain feature of the first audio signal is input into a voice activity detection model, and a voice presence detection result output by the voice activity detection model is obtained, where the voice activity detection model is configured to detect whether voice is present in the first audio signal.

ACOUSTIC ANALYSIS OF CROWD SOUNDS

A method, computer system, and a computer program product for detecting face mask usage based on a crowd sound is provided. The present invention may include capturing an audio stream including a crowd voice data. The present invention may also include analyzing the crowd voice data using a machine learning model to determine an amount of people wearing masks. The present invention may further include in response to determining that the amount of people wearing masks does not meet a compliance threshold, displaying a content to promote face mask usage.

METHODS AND SYSTEMS FOR CLASSIFYING AUDIO SEGMENTS OF AN AUDIO SIGNAL
20170309297 · 2017-10-26 ·

The disclosed embodiments illustrate a method for classifying one or more audio segments of an audio signal. The method includes determining one or more first features of a first audio segment of the one or more audio segments. The method further includes determining one or more second features based on the one or more first features. The method includes determining one or more third features of the first audio segment, wherein each of the one or more third features is determined based on a second feature of the one or more second features of the first audio segment and at least one second feature associated with a second audio segment. Additionally, the method includes classifying the first audio segment either in an interrogative category or a non-interrogative category based on one or more of the one or more second features and the one or more third features.

Methods and Systems for Providing Consistency in Noise Reduction during Speech and Non-Speech Periods
20170221501 · 2017-08-03 ·

Methods and systems for providing consistency in noise reduction during speech and non-speech periods are provided. First and second signals are received. The first signal includes at least a voice component. The second signal includes at least the voice component modified by human tissue of a user. First and second weights may be assigned per subband to the first and second signals, respectively. The first and second signals are processed to obtain respective first and second full-band power estimates. During periods when the user's speech is not present, the first weight and the second weight are adjusted based at least partially on the first full-band power estimate and the second full-band power estimate. The first and second signals are blended based on the adjusted weights to generate an enhanced voice signal. The second signal may be aligned with the first signal prior to the blending.