G10L25/00

Automatic isolation of multiple instruments from musical mixtures

A system, method and computer product for training a neural network system. The method comprises inputting an audio signal to the system to generate plural outputs f(X, Θ). The audio signal includes one or more of vocal content and/or musical instrument content, and each output f(X, Θ) corresponds to a respective one of the different content types. The method also comprises comparing individual outputs f(X, Θ) of the neural network system to corresponding target signals. For each compared output f(X, Θ), at least one parameter of the system is adjusted to reduce a result of the comparing performed for the output f(X, Θ), to train the system to estimate the different content types. In one example embodiment, the system comprises a U-Net architecture. After training, the system can estimate various different types of vocal and/or instrument components of an audio signal, depending on which type of component(s) the system is trained to estimate.

Devices, systems, and methods for real time surveillance of audio streams and uses therefore

Various examples are provided for surveillance of an audio stream. In one example, a method includes identifying presence or absence of a sound type of interest at a location during a time period; selecting the sound type from a library of sound type information to provide a collection of sound type information; incorporating the collection on a device proximate to the location; acquiring an audio stream from the location by the device to provide a locational audio stream; analyzing the locational audio stream to determine whether a sound type in the collection is present in the audio stream; and generating a notification to a user or computer if a sound type in the collection is present. The device can acquire and process the audio stream. In another example, a bulk sound type information library can be generated by identifying sound types of interest including them based upon a confidence level.

Mask estimation apparatus, model learning apparatus, sound source separation apparatus, mask estimation method, model learning method, sound source separation method, and program

A mask estimation apparatus for estimating mask information for specifying a mask used to extract a signal of a specific sound source from an input audio signal includes a converter which converts the input audio signal into embedded vectors of a predetermined dimension using a trained neural network model and a mask calculator which calculates the mask information by fitting the embedded vectors to a mixed Gaussian model.

Hotword-based speaker recognition
11557301 · 2023-01-17 · ·

Systems, methods performed by data processing apparatus and computer storage media encoded with computer programs for receiving an utterance from a user in a multi-user environment, each user having an associated set of available resources, determining that the received utterance includes at least one predetermined word, comparing speaker identification features of the uttered predetermined word with speaker identification features of each of a plurality of previous utterances of the predetermined word, the plurality of previous predetermined word utterances corresponding to different known users in the multi-user environment, attempting to identify the user associated with the uttered predetermined word as matching one of the known users in the multi-user environment, and based on a result of the attempt to identify, selectively providing the user with access to one or more resources associated with a corresponding known user.

Low-complexity tonality-adaptive audio signal quantization

The invention provides an audio encoder for encoding an audio signal so as to produce therefrom an encoded signal, the audio encoder including: a framing device configured to extract frames from the audio signal; a quantizer configured to map spectral lines of a spectrum signal derived from the frame of the audio signal to quantization indices, wherein the quantizer has a dead-zone, in which the input spectral lines are mapped to quantization index zero; and a control device configured to modify the dead-zone; wherein the control device includes a tonality calculating device configured to calculate at least one tonality indicating value for at least one spectrum line or for at least one group of spectral lines, wherein the control device is configured to modify the dead-zone for the at least one spectrum line or the at least one group of spectrum lines depending on the respective tonality indicating value.

Speech style transfer

Computer-implemented methods for speech synthesis are provided. A speech synthesizer may be trained to generate synthesized audio data that corresponds to words uttered by a source speaker according to speech characteristics of a target speaker. The speech synthesizer may be trained by time-stamped phoneme sequences, pitch contour data and speaker identification data. The speech synthesizer may include a voice modeling neural network and a conditioning neural network.

Audio coding method based on spectral recovery scheme

An inventive concept relates to an audio coding method to which CNN-based frequency spectrum recovery is applied. An inventive concept transmits a part of frequency spectral coefficients generated in transform coding to a decoder and the decoder recovers the frequency spectral coefficient not transmitted. Furthermore, the signs of frequency spectral coefficient are transmitted from an encoder to the decoder depending on a sign transmission rule.

Voice controlled system

A distributed voice controlled system has a primary assistant and at least one secondary assistant. The primary assistant has a housing to hold one or more microphones, one or more speakers, and various computing components. The secondary assistant is similar in structure, but is void of speakers. The voice controlled assistants perform transactions and other functions primarily based on verbal interactions with a user. The assistants within the system are coordinated and synchronized to perform acoustic echo cancellation, selection of a best audio input from among the assistants, and distributed processing.

Voice controlled system

A distributed voice controlled system has a primary assistant and at least one secondary assistant. The primary assistant has a housing to hold one or more microphones, one or more speakers, and various computing components. The secondary assistant is similar in structure, but is void of speakers. The voice controlled assistants perform transactions and other functions primarily based on verbal interactions with a user. The assistants within the system are coordinated and synchronized to perform acoustic echo cancellation, selection of a best audio input from among the assistants, and distributed processing.

Methods and systems for passive wakeup of a user interaction device

The embodiments herein disclose methods and systems for passive wakeup of a user interaction device and configuring a dynamic wakeup time for a user interaction device, a method includes detecting an occurrence of at least one first non-voice event associated with at least one device present in an Internet of Things (IoT) environment. The method includes detecting an occurrence of at least one successive event associated with the at least one device. The method includes estimating a contextual probability of initiating at least one interaction by a user with the user interaction device on detecting the occurrence of at least one of the at least one first event and the at least one successive event. On determining the estimated contextual probability is above a pre-defined threshold value, the method includes configuring the dynamic wakeup time to switch the user interaction device to a passive wakeup state.