G10L2025/783

Learning device, voice activity detector, and method for detecting voice activity

A likelihood of voice that is a discrimination measure between noise and voice is corrected, by using a Gaussian mixture model of noise learned in a time section in which an input signal is noise, and a voice activity is detected on the basis of the corrected likelihood of voice.

Methods and systems for correcting, based on speech, input generated using automatic speech recognition
11521608 · 2022-12-06 · ·

Methods and systems for correcting, based on subsequent second speech, an error in an input generated from first speech using automatic speech recognition, without an explicit indication in the second speech that a user intended to correct the input with the second speech, include determining that a time difference between when search results in response to the input were displayed and when the second speech was received is less than a threshold time, and based on the determination, correcting the input based on the second speech. The methods and systems also include determining that a difference in acceleration of a user input device, used to input the first speech and second speech, between when the search results in response to the input were displayed and when the second speech was received is less than a threshold acceleration, and based on the determination, correcting the input based on the second speech.

Neural network accelerator with compact instruct set
11520561 · 2022-12-06 · ·

Described herein is a neural network accelerator with a set of neural processing units and an instruction set for execution on the neural processing units. The instruction set is a compact instruction set including various compute and data move instructions for implementing a neural network. Among the compute instructions are an instruction for performing a fused operation comprising sequential computations, one of which involves matrix multiplication, and an instruction for performing an elementwise vector operation. The instructions in the instruction set are highly configurable and can handle data elements of variable size. The instructions also implement a synchronization mechanism that allows asynchronous execution of data move and compute operations across different components of the neural network accelerator as well as between multiple instances of the neural network accelerator.

Anchored speech detection and speech recognition

A system configured to process speech commands may classify incoming audio as desired speech, undesired speech, or non-speech. Desired speech is speech that is from a same speaker as reference speech. The reference speech may be obtained from a configuration session or from a first portion of input speech that includes a wakeword. The reference speech may be encoded using a recurrent neural network (RNN) encoder to create a reference feature vector. The reference feature vector and incoming audio data may be processed by a trained neural network classifier to label the incoming audio data (for example, frame-by-frame) as to whether each frame is spoken by the same speaker as the reference speech. The labels may be passed to an automatic speech recognition (ASR) component which may allow the ASR component to focus its processing on the desired speech.

Apparatus and method for voice event detection

A voice event detection apparatus is disclosed. The apparatus comprises a vibration to digital converter and a computing unit. The vibration to digital converter is configured to convert an input audio signal into vibration data. The computing unit is configured to trigger a downstream module according to a sum of vibration counts of the vibration data for a number X of frames. In an embodiment, the voice event detection apparatus is capable of correctly distinguishing a wake phoneme from the input vibration data so as to trigger a downstream module of a computing system. Thus, the power consumption of the computing system is saved.

Method and system for controlling speaker tracking in a video conferencing system
11589005 · 2023-02-21 · ·

A video conferencing device for video conferencing between at least one local participant and a remote participant includes a video camera, a microphone array, and a speaker tracker. The video camera provides a local video input signal. The microphone array provides a local audio input signal. The speaker tracker is configured to identify a local speaker from the at least one local participant using a sound source localizer. The video conferencing device processes the local video input signal without the local speaker, based on the video conferencing device receiving a signal from a computing system, the signal dependent on a loopback audio output signal indicating that the remote participant is speaking.

Multithreaded Speech-to-Text Processing

An apparatus includes a processor to: receive a request to perform speech-to-text conversion of a speech data set; perform pause detection to identify a set of likely sentence pauses and/or speaker diarization technique to identify a set of likely speaker changes; based the set of likely sentence pauses and/or the set of likely speaker changes, divide the speech data set into data segments representing speech segments; use an acoustic model with the data segments to derive sets of probabilities of speech sounds uttered; store the sets of probabilities in temporal order within a buffer queue; distribute the sets of probabilities from the buffer queue in temporal order among threads of a thread pool; and within each thread, and based on set(s) of probabilities, derive one candidate word and select either the candidate word or an alternate candidate word derived from a language model as the next word most likely spoken.

Microphone having a digital output determined at different power consumption levels
11617048 · 2023-03-28 · ·

An acoustic device is described and includes an acoustic sensor element configured to sense acoustic energy and produce an output signal and a threshold detector circuit including a switch having an input coupled to the output of the acoustic sensor element to receive the output signal, a control port that receives a control signal, and first and second output ports, a first channel including an analog-to-digital converter that operates at a first power level a second analog-to-digital converter that operates at a second higher power level, relative to the first power level and a threshold level detector that receives an output from the first analog-to-digital converter to produce the control signal having a first state that causes the switch feed the output signal from the acoustic sensor element to the second analog-to-digital converter when the first digitized output signal meets a threshold criteria.

Jump counting method for jump rope
11484752 · 2022-11-01 ·

A jump counting method for jump rope is provided. The jump counting method comprises: S1, obtaining an original video data of a jump rope movement, and extracting an audio data and an image data from the original video data; S2, calculating the number of jumps of the rope jumper according to an audio information and an image information extracted from the audio data and the image data; and S3, outputting and displaying the calculation result.

AUDIO SYSTEM FOR SPATIALIZING VIRTUAL SOUND SOURCES
20230093585 · 2023-03-23 ·

An audio system for spatializing virtual sound sources is described. A microphone array of the audio system is configured to monitor sound in a local area. A controller of the audio system identifies sound sources within the local area using the monitored sound from the microphone array and determines their locations. The controller of the audio system generates a target position for a virtual sound source based on one or more constraints. The one or more constraints include that the target position be at least a threshold distance away from each of the determined locations of the identified sound sources. The controller generates one or more sound filters based in part on the target position to spatialize the virtual sound source. A transducer array of the audio system presents spatialized audio including the virtual sound source content based in part on the one or more sound filters.