G10L2025/783

METHOD, APPARATUS, AND SYSTEM FOR VOICE ACTIVITY DETECTION BASED ON RADIO SIGNALS

Methods, apparatus and systems for radio-based voice activity detection are described. In one example, a described system comprises: a transmitter configured to transmit a radio signal through a wireless channel of a venue; a receiver configured to receive the radio signal through the wireless channel, wherein the wireless channel is impacted by a voice activity of a target voice source in the venue; and a processor. The processor is configured for: computing a time series of channel information (CI) of the wireless channel based on the radio signal, and detecting the voice activity of the target voice source based on the time series of CI (TSCI) of the wireless channel, without using any media signal.

METHODS, APPARATUS, AND NON-TRANSITORY COMPUTER READABLE MEDIUM FOR AUDIO PROCESSING
20230080446 · 2023-03-16 ·

An audio processing method is provided. The method includes: obtaining to-be-processed audio acquired by an audio acquisition end; performing filtering processing on the to-be-processed audio to obtain a processing result, wherein the filtering processing is used for filtering out partial audio signal components from the to-be-processed audio, and frequencies of the partial audio signal components are lower than a preset threshold; extracting a plurality of speech frames within a first preset duration from the processing result; obtaining an energy variation amount of the plurality of speech frames; and determining a category of the to-be-processed audio based on the energy variation amount.

SOUND SOURCE LOCALIZATION MODEL TRAINING AND SOUND SOURCE LOCALIZATION METHOD, AND APPARATUS

The present disclosure provides a method for training sound source localization model and a sound source localization method, and relates to the field of artificial intelligence technologies such as voice processing and deep learning. The method for training sound source localization model method includes: obtaining a sample audio according to an audio signal including a wake-up word; extracting an audio feature of at least one audio frame in the sample audio, and marking a direction label and a mask label of the at least one audio frame; and training a neural network model by using the audio feature of the at least one audio frame and the direction label and the mask label of the at least one audio frame, to obtain a sound source localization model. The sound source localization method includes: acquiring a to-be-processed audio signal, and extracting an audio feature of each audio frame in the to-be-processed audio signal; inputting the audio feature of each audio frame into a sound source localization model, to obtain sound source direction information outputted by the sound source localization model for each audio frame; determining a wake-up word endpoint frame in the to-be-processed audio signal; and obtaining a sound source direction of the to-be-processed audio signal according to sound source direction information corresponding to the wake-up word endpoint frame.

DEEP NEURAL NETWORKS-BASED VOICE-AI PLUGIN FOR HUMAN-COMPUTER INTERFACES
20230079775 · 2023-03-16 ·

A method for implementing channels with a voice-based artificial intelligence (AI) functionality that enables human users to interact and transact with a business entity through one or more natural voice conversations; implementing a user identification and authentication on the voice input from the voice channel; generating a transcription of the voice input; passing the transcript to a natural language understanding (NLU) engine and with the NLU engine: implementing machine learning algorithm for intent, entity, and context identification on the input; with the dialogue manager, understanding the conversation state, predicting the right action and response based on the intent, entity, context, and the user emotion; with a natural language generation module that comprises a natural language generation functionality: implementing a computerized voice generation, generating a voice output comprising a relevant response to the voice input, and providing a voice output channel; and providing the voice output to user.

Privacy device for smart speakers
11606658 · 2023-03-14 ·

Systems, apparatuses, and methods are described for a privacy blocking device configured to prevent receipt, by a listening device, of video and/or audio data until a trigger occurs. A blocker may be configured to prevent receipt of video and/or audio data by one or more microphones and/or one or more cameras of a listening device. The blocker may use the one or more microphones, the one or more cameras, and/or one or more second microphones and/or one or more second cameras to monitor for a trigger. The blocker may process the data. Upon detecting the trigger, the blocker may transmit data to the listening device. For example, the blocker may transmit all or a part of a spoken phrase to the listening device.

SPEECH SEGMENTATION BASED ON COMBINATION OF PAUSE DETECTION AND SPEAKER DIARIZATION

An apparatus includes at least one processor to, in response to a request to perform speech-to-text conversion: perform a pause detection technique including analyzing speech audio to identify pauses, and analyzing lengths of the pauses to identify likely sentence pauses; perform a speaker diarization technique including dividing the speech audio into fragments, analyzing vocal characteristics of speech sounds of each fragment to identify a speaker of a set of speakers, and identifying instances of a change in speakers between each temporally consecutive pair of fragments to identify likely speaker changes; and perform speech-to-text operations including dividing the speech audio into segments based on at least the likely sentence pauses and likely speaker changes, using at least an acoustic model with each segment to identify likely speech sounds in the speech audio, and generating a transcript of the speech audio based at least on the likely speech sounds.

SYSTEM AND METHOD FOR REAL-TIME DETECTION OF USER'S ATTENTION SOUND BASED ON NEURAL SIGNALS, AND AUDIO OUTPUT DEVICE USING THE SAME

A system for detecting a sound to which a user is attending based on neural signals includes an audio signal collection unit to collect audio signals including two or more sounds from a surrounding environment around the user; a neural signal collection unit to collect the neural signals of the user; an attended sound detection unit to analyze correlations between the two or more sounds included in the audio signals and the neural signals of the user in real time and determine the sound to which the user is attending based on the correlations; a database unit to store a result of the detection; and an output unit to select and output the stored individual audio signal or output the result of detecting the attended sound in real time according to the presence or absence of the audio signal in the surrounding environment around the user.

Speech endpointing based on word comparisons

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech endpointing based on word comparisons are described. In one aspect, a method includes the actions of obtaining a transcription of an utterance. The actions further include determining, as a first value, a quantity of text samples in a collection of text samples that (i) include terms that match the transcription, and (ii) do not include any additional terms. The actions further include determining, as a second value, a quantity of text samples in the collection of text samples that (i) include terms that match the transcription, and (ii) include one or more additional terms. The actions further include classifying the utterance as a likely incomplete utterance or not a likely incomplete utterance based at least on comparing the first value and the second value.

Apparatus, system and method for directing voice input in a controlling device
11631403 · 2023-04-18 · ·

A system and method for controlling a controllable appliance resident in an environment which includes a device adapted to receive speech input. The system and method establishes a noise threshold for the environment in which the device is operating, receives at the device a speech input, determines a noise level for the environment at the time the speech input is received by the device, compares the determined noise level to the established noise threshold, and causes one or more commands to be automatically issued to the controllable device to thereby cause the controllable device to transition from a first volume level to a second volume level that is less than the first volume level when the determined noise level for the environment is greater than the established noise threshold for the environment.

Automatic Multi-Camera Production In Video Conferencing
20230036861 · 2023-02-02 ·

A video conference system has a multi-camera setup to allow automatic switching of video feeds without intervention from a video conference host. The video conference system obtains video feeds from multiple cameras at the same location and displays the video feed from one of the cameras in a primary area of a display. The video conference system determines a relevance score for each of the video feeds. The relevance scores are associated with a participant engagement score. The video conference system automatically displays a video feed in the primary area of the display based on a corresponding relevance score of the video feed.