G10L19/00

DISTRIBUTED SENSOR DATA PROCESSING USING MULTIPLE CLASSIFIERS ON MULTIPLE DEVICES
20230230597 · 2023-07-20 ·

According to an aspect, a method for distributed sound/image recognition using a wearable device includes receiving, via at least one sensor device, sensor data, and detecting, by a classifier of the wearable device, whether or not the sensor data includes an object of interest. The classifier configured to execute a first machine learning (ML) model. The method includes transmitting, via a wireless connection, the sensor data to a computing device in response to the object of interest being detected within the sensor data, where the sensor data is configured to be used by a second ML model on the computing device or a server computer for further sound/image classification.

DISTRIBUTED SENSOR DATA PROCESSING USING MULTIPLE CLASSIFIERS ON MULTIPLE DEVICES
20230230597 · 2023-07-20 ·

According to an aspect, a method for distributed sound/image recognition using a wearable device includes receiving, via at least one sensor device, sensor data, and detecting, by a classifier of the wearable device, whether or not the sensor data includes an object of interest. The classifier configured to execute a first machine learning (ML) model. The method includes transmitting, via a wireless connection, the sensor data to a computing device in response to the object of interest being detected within the sensor data, where the sensor data is configured to be used by a second ML model on the computing device or a server computer for further sound/image classification.

Apparatus and method for post-processing an audio signal using prediction based shaping

What is described is an apparatus for post-processing an audio signal, having: a time-spectrum-converter for converting the audio signal into a spectral representation having a sequence of spectral frames; a prediction analyzer for calculating prediction filter data for a prediction over frequency within a spectral frame; a shaping filter controlled by the prediction filter data for shaping the spectral frame to enhance a transient portion within the spectral frame; and a spectrum-time-converter for converting a sequence of spectral frames having a shaped spectral frame into a time domain.

TRANSMISSION DEVICE, TRANSMISSION METHOD, RECEPTION DEVICE, AND RECEPTION METHOD
20230230601 · 2023-07-20 · ·

A processing load at a receiving side is reduced in a case where a plurality of classes of audio data is transmitted. A predetermined number of audio streams including coded data of a plurality of groups is generated and a container of a predetermined format having this predetermined number of audio streams is transmitted. Command information for creating a command specifying a group to be decoded from among the plurality of groups is inserted into the container and/or the audio stream. For example, a command insertion area for the receiving side to insert a command for specifying a group to be decoded is provided in at least one audio stream among the predetermined number of audio streams.

Apparatus for encoding and decoding of integrated speech and audio

Provided is an encoding apparatus for integrally encoding and decoding a speech signal and a audio signal, and may include: an input signal analyzer to analyze a characteristic of an input signal; a stereo encoder to down mix the input signal to a mono signal when the input signal is a stereo signal, and to extract stereo sound image information; a frequency band expander to expand a frequency band of the input signal; a sampling rate converter to convert a sampling rate; a speech signal encoder to encode the input signal using a speech encoding module when the input signal is a speech characteristics signal; a audio signal encoder to encode the input signal using a audio encoding module when the input signal is a audio characteristic signal; and a bitstream generator to generate a bitstream.

Methods and systems for encoding frequency-domain data

An illustrative frequency-domain encoder system transforms time-domain data representative of a content instance into frequency-domain data representative of the content instance. The frequency-domain data includes a plurality of complex coefficients each representing different frequency components of a plurality of frequency components incorporated by the content instance. The frequency-domain encoder system generates a frequency-domain data container that includes the complex coefficients of the frequency-domain data and metadata descriptive of the frequency-domain data. Additionally, within the frequency-domain data container, the frequency-domain encoder system integrates the complex coefficients of the frequency-domain data with timing data representative of a time-dependent feature of the content instance. Corresponding systems and methods are also disclosed.

Audio encoder and bandwidth extension decoder

An audio encoder for providing an output signal using an input audio signal includes a patch generator, a comparator and an output interface. The patch generator generates at least one bandwidth extension high-frequency signal, wherein a bandwidth extension high-frequency signal includes a high-frequency band. The high-frequency band of the bandwidth extension high-frequency signal is based on a low frequency band of the input audio signal. A comparator calculates a plurality of comparison parameters. A comparison parameter is calculated based on a comparison of the input audio signal and a generated bandwidth extension high-frequency signal. Each comparison parameter of the plurality of comparison parameters is calculated based on a different offset frequency between the input audio signal and a generated bandwidth extension high-frequency signal. Further, the comparator determines a comparison parameter from the plurality of comparison parameters, wherein the determined comparison parameter fulfils a predefined criterion.

Audio encoder and bandwidth extension decoder

An audio encoder for providing an output signal using an input audio signal includes a patch generator, a comparator and an output interface. The patch generator generates at least one bandwidth extension high-frequency signal, wherein a bandwidth extension high-frequency signal includes a high-frequency band. The high-frequency band of the bandwidth extension high-frequency signal is based on a low frequency band of the input audio signal. A comparator calculates a plurality of comparison parameters. A comparison parameter is calculated based on a comparison of the input audio signal and a generated bandwidth extension high-frequency signal. Each comparison parameter of the plurality of comparison parameters is calculated based on a different offset frequency between the input audio signal and a generated bandwidth extension high-frequency signal. Further, the comparator determines a comparison parameter from the plurality of comparison parameters, wherein the determined comparison parameter fulfils a predefined criterion.

Advancing the Use of Text and Speech in ASR Pretraining With Consistency and Contrastive Losses

A method includes receiving training data that includes unspoken text utterances, un-transcribed non-synthetic speech utterances, and transcribed non-synthetic speech utterances. Each unspoken text utterance is not paired with any corresponding spoken utterance of non-synthetic speech. Each un-transcribed non-synthetic speech utterance is not paired with a corresponding transcription. Each transcribed non-synthetic speech utterance is paired with a corresponding transcription. The method also includes generating a corresponding synthetic speech representation for each unspoken textual utterance of the received training data using a text-to-speech model. The method also includes pre-training an audio encoder on the synthetic speech representations generated for the unspoken textual utterances, the un-transcribed non-synthetic speech utterances, and the transcribed non-synthetic speech utterances to teach the audio encoder to jointly learn shared speech and text representations.

Advancing the Use of Text and Speech in ASR Pretraining With Consistency and Contrastive Losses

A method includes receiving training data that includes unspoken text utterances, un-transcribed non-synthetic speech utterances, and transcribed non-synthetic speech utterances. Each unspoken text utterance is not paired with any corresponding spoken utterance of non-synthetic speech. Each un-transcribed non-synthetic speech utterance is not paired with a corresponding transcription. Each transcribed non-synthetic speech utterance is paired with a corresponding transcription. The method also includes generating a corresponding synthetic speech representation for each unspoken textual utterance of the received training data using a text-to-speech model. The method also includes pre-training an audio encoder on the synthetic speech representations generated for the unspoken textual utterances, the un-transcribed non-synthetic speech utterances, and the transcribed non-synthetic speech utterances to teach the audio encoder to jointly learn shared speech and text representations.