G10L21/00

Neural Network Audio Scene Classifier for Hearing Implants

An audio scene classifier classifies an audio input signal from an audio scene and includes a pre-processing neural network configured for pre-processing the audio input signal based on initial classification parameters to produce an initial signal classification, and a scene classifier neural network configured for processing the initial scene classification based on scene classification parameters to produce an audio scene classification output. The initial classification parameters reflect neural network training based on a first set of initial audio training data, and the scene classification parameters reflect neural network training on a second set of classification audio training data separate and different from the first set of initial audio training data. A hearing implant signal processor configured for processing the audio input signal and the audio scene classification output to generate the stimulation signals to the hearing implant for perception by the patient as sound.

Method and apparatus for predicting mouth-shape feature, and electronic device

A method and apparatus for predicting a mouth-shape feature, and an electronic device are provided. A specific implementation of the method comprises: recognizing a phonetic posterior gram (PPG) of a phonetic feature; and performing a prediction on the PPG by using a neural network model, to predict a mouth-shape feature of the phonetic feature, the neural network model being obtained by training with training samples and an input thereof including a PPG and an output thereof including a mouth-shape feature, and the training samples including a PPG training sample and a mouth-shape feature training sample.

Audio Generation Methods and Systems

A method of generating audio assets, comprising the steps of: receiving a plurality of input audio assets, converting each input audio asset into an input graphical representation, generating an input multi-channel image by stacking each input graphical representation in a separate channel of the image, feeding the input multi-channel image into a generative model to train the generative model and generate one or more output multi-channel images, each output multi-channel image comprising an output graphical representation, extracting the output graphical representations from each output multi-channel image and converting each output graphical representation into an output audio asset.

Audio Generation Methods and System

A method of generating audio assets, comprising the steps of: receiving an input multi-layered audio asset comprising a plurality of audio layers, generating an input multi-channel image, wherein each channel of the input multi-channel image comprises an input image representative of one of the audio layers, training a generative model on the input multi-channel image and implementing the trained generative model to generate an output multi-channel image, wherein each channel of the output multi-channel image comprises an output image representative of an output audio layer, and generating an output multi-layered audio asset based on a combination of output audio layers derived from the output images.

Hotword-based speaker recognition
11557301 · 2023-01-17 · ·

Systems, methods performed by data processing apparatus and computer storage media encoded with computer programs for receiving an utterance from a user in a multi-user environment, each user having an associated set of available resources, determining that the received utterance includes at least one predetermined word, comparing speaker identification features of the uttered predetermined word with speaker identification features of each of a plurality of previous utterances of the predetermined word, the plurality of previous predetermined word utterances corresponding to different known users in the multi-user environment, attempting to identify the user associated with the uttered predetermined word as matching one of the known users in the multi-user environment, and based on a result of the attempt to identify, selectively providing the user with access to one or more resources associated with a corresponding known user.

Adaptive multichannel dereverberation for automatic speech recognition

Utilizing an adaptive multichannel technique to mitigate reverberation present in received audio signals, prior to providing corresponding audio data to one or more additional component(s), such as automatic speech recognition (ASR) components. Implementations disclosed herein are “adaptive”, in that they utilize a filter, in the reverberation mitigation, that is online, causal and varies depending on characteristics of the input. Implementations disclosed herein are “multichannel”, in that a corresponding audio signal is received from each of multiple audio transducers (also referred to herein as “microphones”) of a client device, and the multiple audio signals (e.g., frequency domain representations thereof) are utilized in updating of the filter—and dereverberation occurs for audio data corresponding to each of the audio signals (e.g., frequency domain representations thereof) prior to the audio data being provided to ASR component(s) and/or other component(s).

DEVICE FOR DETECTING MUSIC DATA FROM VIDEO CONTENTS, AND METHOD FOR CONTROLLING SAME

A data processing method according to the present invention comprises the steps of: receiving an input of video contents including a video stream and an audio stream; detecting music data from the audio stream; and filtering the audio stream so that the music data detected from the audio stream is removed.

Electronic device and operation method therefor

Various embodiments of the present invention pertain to an electronic device and an operation method therefor. The electronic device comprises: a housing that includes a circular upper end surface comprising a plurality of openings having a selected pattern, a flat circular lower end surface and a side surface surrounding a space between the upper end surface and the lower end surface; an audio output interface that is formed on the side surface; a power input interface that is formed on the side surface; a microphone that is located inside the housing, and that faces the openings; a wireless communication circuit; a processor that is operatively connected to the audio output interface, the power input interface, the microphone and the communication circuit; and a memory that is operatively connected to the processor, wherein the memory, when the electronic device is executed, can store instructions for the processor to receive a wake-up command through the microphone, to recognize the wake-up command, to transmit to a server information regarding reception of the wake-up command using the communication circuit, to receive a response from the server using the communication circuit, to generate a first audio signal based on the response, and to output the first audio signal using the audio output interface when the microphone is available, wherein the audio signal can be a non-language sound. Various embodiments are also possible.

Coding device, decoding device, and method and program thereof

A coding method and a decoding method are provided which can use in combination a predictive coding and decoding method which is a coding and decoding method that can accurately express coefficients which are convertible into linear prediction coefficients with a small code amount and a coding and decoding method that can obtain correctly, by decoding, coefficients which are convertible into linear prediction coefficients of the present frame if a linear prediction coefficient code of the present frame is correctly input to a decoding device. A coding device includes: a predictive coding unit that obtains a first code by coding a differential vector formed of differentials between a vector of coefficients which are convertible into linear prediction coefficients of more than one order of the present frame and a prediction vector containing at least a predicted vector from a past frame, and obtains a quantization differential vector corresponding to the first code; and a non-predictive coding unit that generates a second code by coding a correction vector which is formed of differentials between the vector of the coefficients which are convertible into the linear prediction coefficients of more than one order of the present frame and the quantization differential vector or formed of some of elements of the differentials.

Identifier

A computer device (100), configured to encode identifiers by providing audio identifiers therefrom, is described. The computer device (100) is configured to provide a set of audio signals as respective bitstreams. Each audio signal of the set of audio signals is defined based, at least in part, on audio signal information including at least one of a type, a fundamental frequency, a time signature and a time. Each audio signal comprises a set of audio segments. Each audio segment of the set of audio segments is defined based, at least in part, on audio segment information including at least one of a frequency, an amplitude, a transform, a time duration and an envelope. The computer device (100) is configured to receive an identifier and select a subset of audio signals from the set of audio signals according to the received identifier based, at least in part, on the audio signal information and/or the audio segment information. The computer device (100) is configured to process the audio selected subset of audio signals by combining the selected subset of audio signals to provide an audio identifier. The computer device (100) is configured to output the audio identifier in an output audio signal as an output bitstream, wherein the audio identifier encodes the identifier. Also described is a method of encoding identifiers by providing audio identifiers therefrom.