Patent classifications
G10L19/035
Self-supervised audio representation learning for mobile devices
Systems and methods for training a machine-learned model are provided. A method can include can include obtaining an unlabeled audio signal, sampling the unlabeled audio signal to select one or more sampled slices, inputting the one or more sampled slices into a machine-learned model, receiving, as an output of the machine-learned model, one or more determined characteristics associated with the audio signal, determining a loss function for the machine-learned model based at least in part on a difference between the one or more determined characteristics and one or more corresponding ground truth characteristics of the audio signal, and training the machine-learned model from end to end based at least in part on the loss function. The one or more determined characteristics can include one or more reconstructed portions of the audio signal temporally adjacent to the one or more sampled slices or an estimated distance between two sampled slices.
Signal encoding method and apparatus and signal decoding method and apparatus
A spectrum coding method includes quantizing spectral data of a current band based on a first quantization scheme, generating a lower bit of the current band using the spectral data and the quantized spectral data, quantizing a sequence of lower bits including the lower bit of the current band based on a second quantization scheme, and generating a bitstream based on a upper bit excluding N bits, where N is 1 or greater, from the quantized spectral data and the quantized sequence of lower bits.
Self-Supervised Audio Representation Learning for Mobile Devices
Systems and methods for training a machine-learned model are provided. A method can include can include obtaining an unlabeled audio signal, sampling the unlabeled audio signal to select one or more sampled slices, inputting the one or more sampled slices into a machine-learned model, receiving, as an output of the machine-learned model, one or more determined characteristics associated with the audio signal, determining a loss function for the machine-learned model based at least in part on a difference between the one or more determined characteristics and one or more corresponding ground truth characteristics of the audio signal, and training the machine-learned model from end to end based at least in part on the loss function. The one or more determined characteristics can include one or more reconstructed portions of the audio signal temporally adjacent to the one or more sampled slices or an estimated distance between two sampled slices.
Self-Supervised Audio Representation Learning for Mobile Devices
Systems and methods for training a machine-learned model are provided. A method can include can include obtaining an unlabeled audio signal, sampling the unlabeled audio signal to select one or more sampled slices, inputting the one or more sampled slices into a machine-learned model, receiving, as an output of the machine-learned model, one or more determined characteristics associated with the audio signal, determining a loss function for the machine-learned model based at least in part on a difference between the one or more determined characteristics and one or more corresponding ground truth characteristics of the audio signal, and training the machine-learned model from end to end based at least in part on the loss function. The one or more determined characteristics can include one or more reconstructed portions of the audio signal temporally adjacent to the one or more sampled slices or an estimated distance between two sampled slices.
QUANTIZATION OF SPATIAL AUDIO DIRECTION PARAMETERS
A method for spatial audio signal encoding comprising: obtaining, for a first frame, a plurality of audio direction parameters, wherein each parameter comprises an elevation value and an azimuth value and wherein each parameter has an ordered position; determining whether, for a preceding frame, any of the plurality of audio direction parameters was differentially encoded based on a difference between the preceding frame parameter elevation value and a further preceding frame parameter elevation value and the preceding frame parameter azimuth value and a further preceding frame parameter azimuth value; generating, for any audio direction parameter which was not differentially encoded in the considered preceding frame, a differential parameter value based on a difference between the frame parameter elevation value and a preceding frame parameter elevation value and a difference between the frame parameter azimuth value and a preceding frame parameter azimuth value; generating for each of the plurality of audio direction parameters a difference parameter value based on a difference between the audio direction parameter and a rotated derived audio direction parameter; quantizing the difference between the audio direction parameter and a rotated derived audio direction parameter and the differential parameter value; and selecting for each of the plurality of audio direction parameters, either of the quantized difference or differential parameter value.
QUANTIZATION OF SPATIAL AUDIO DIRECTION PARAMETERS
A method for spatial audio signal encoding comprising: obtaining, for a first frame, a plurality of audio direction parameters, wherein each parameter comprises an elevation value and an azimuth value and wherein each parameter has an ordered position; determining whether, for a preceding frame, any of the plurality of audio direction parameters was differentially encoded based on a difference between the preceding frame parameter elevation value and a further preceding frame parameter elevation value and the preceding frame parameter azimuth value and a further preceding frame parameter azimuth value; generating, for any audio direction parameter which was not differentially encoded in the considered preceding frame, a differential parameter value based on a difference between the frame parameter elevation value and a preceding frame parameter elevation value and a difference between the frame parameter azimuth value and a preceding frame parameter azimuth value; generating for each of the plurality of audio direction parameters a difference parameter value based on a difference between the audio direction parameter and a rotated derived audio direction parameter; quantizing the difference between the audio direction parameter and a rotated derived audio direction parameter and the differential parameter value; and selecting for each of the plurality of audio direction parameters, either of the quantized difference or differential parameter value.
Quantization of spatial audio parameters
There is disclosed inter alia an apparatus for spatial audio signal encoding which determines at least one spatial audio parameter comprising a direction parameter with an elevation component and an azimuth component. The elevation component and azimuth component of the direction parameter are then converted to an index value.
Quantization of spatial audio parameters
There is disclosed inter alia an apparatus for spatial audio signal encoding which determines at least one spatial audio parameter comprising a direction parameter with an elevation component and an azimuth component. The elevation component and azimuth component of the direction parameter are then converted to an index value.
Decoding audio bitstreams with enhanced spectral band replication metadata in at least one fill element
Embodiments relate to an audio processing unit that includes a buffer, bitstream payload deformatter, and a decoding subsystem. The buffer stores at least one block of an encoded audio bitstream. The block includes a fill element that begins with an identifier followed by fill data. The fill data includes at least one flag identifying whether enhanced spectral band replication (eSBR) processing is to be performed on audio content of the block. A corresponding method for decoding an encoded audio bitstream is also provided.
Decoding audio bitstreams with enhanced spectral band replication metadata in at least one fill element
Embodiments relate to an audio processing unit that includes a buffer, bitstream payload deformatter, and a decoding subsystem. The buffer stores at least one block of an encoded audio bitstream. The block includes a fill element that begins with an identifier followed by fill data. The fill data includes at least one flag identifying whether enhanced spectral band replication (eSBR) processing is to be performed on audio content of the block. A corresponding method for decoding an encoded audio bitstream is also provided.