G10L19/008

Synthetic speech processing

A speech-processing system receives input data representing text. A first encoder processes segments of the text to determine embedding data representing the text, and a second encoder processes corresponding audio data to determine prosodic data corresponding to the text. The embedding and prosodic data is processed to create output data including a representation of speech corresponding to the text and prosody.

Synthetic speech processing

A speech-processing system receives input data representing text. A first encoder processes segments of the text to determine embedding data representing the text, and a second encoder processes corresponding audio data to determine prosodic data corresponding to the text. The embedding and prosodic data is processed to create output data including a representation of speech corresponding to the text and prosody.

Methods and systems for improved signal decomposition

A method for improving decomposition of digital signals using training sequences is presented. A method for improving decomposition of digital signals using initialization is also provided. A method for sorting digital signals using frames based upon energy content in the frame is further presented. A method for utilizing user input for combining parts of a decomposed signal is also presented.

Methods and systems for improved signal decomposition

A method for improving decomposition of digital signals using training sequences is presented. A method for improving decomposition of digital signals using initialization is also provided. A method for sorting digital signals using frames based upon energy content in the frame is further presented. A method for utilizing user input for combining parts of a decomposed signal is also presented.

Reconstruction of audio scenes from a downmix

Audio objects are associated with positional metadata. A received downmix signal comprises downmix channels that are linear combinations of one or more audio objects and are associated with respective positional locators. In a first aspect, the downmix signal, the positional metadata and frequency-dependent object gains are received. An audio object is reconstructed by applying the object gain to an upmix of the downmix signal in accordance with coefficients based on the positional metadata and the positional locators. In a second aspect, audio objects have been encoded together with at least one bed channel positioned at a positional locator of a corresponding downmix channel. The decoding system receives the downmix signal and the positional metadata of the audio objects. A bed channel is reconstructed by suppressing the content representing audio objects from the corresponding downmix channel on the basis of the positional locator of the corresponding downmix channel.

Reconstruction of audio scenes from a downmix

Audio objects are associated with positional metadata. A received downmix signal comprises downmix channels that are linear combinations of one or more audio objects and are associated with respective positional locators. In a first aspect, the downmix signal, the positional metadata and frequency-dependent object gains are received. An audio object is reconstructed by applying the object gain to an upmix of the downmix signal in accordance with coefficients based on the positional metadata and the positional locators. In a second aspect, audio objects have been encoded together with at least one bed channel positioned at a positional locator of a corresponding downmix channel. The decoding system receives the downmix signal and the positional metadata of the audio objects. A bed channel is reconstructed by suppressing the content representing audio objects from the corresponding downmix channel on the basis of the positional locator of the corresponding downmix channel.

Generating binaural audio in response to multi-channel audio using at least one feedback delay network

In some embodiments, virtualization methods for generating a binaural signal in response to channels of a multi-channel audio signal, which apply a binaural room impulse response (BRIR) to each channel including by using at least one feedback delay network (FDN) to apply a common late reverberation to a downmix of the channels. In some embodiments, input signal channels are processed in a first processing path to apply to each channel a direct response and early reflection portion of a single-channel BRIR for the channel, and the downmix of the channels is processed in a second processing path including at least one FDN which applies the common late reverberation. Typically, the common late reverberation emulates collective macro attributes of late reverberation portions of at least some of the single-channel BRIRs. Other aspects are headphone virtualizers configured to perform any embodiment of the method.

Generating binaural audio in response to multi-channel audio using at least one feedback delay network

In some embodiments, virtualization methods for generating a binaural signal in response to channels of a multi-channel audio signal, which apply a binaural room impulse response (BRIR) to each channel including by using at least one feedback delay network (FDN) to apply a common late reverberation to a downmix of the channels. In some embodiments, input signal channels are processed in a first processing path to apply to each channel a direct response and early reflection portion of a single-channel BRIR for the channel, and the downmix of the channels is processed in a second processing path including at least one FDN which applies the common late reverberation. Typically, the common late reverberation emulates collective macro attributes of late reverberation portions of at least some of the single-channel BRIRs. Other aspects are headphone virtualizers configured to perform any embodiment of the method.

Method and apparatus for determining weighting factor during stereo signal encoding

Various embodiments provide a method and an apparatus for determining a weighting factor during stereo signal encoding. In those embodiments, a parameter value corresponding to the encoding mode of the to-be-encoded signal is determining based on an encoding mode of a to-be-encoded signal in a stereo signal and a correspondence between an encoding mode and a parameter value. Based on the determined parameter value and an energy spectrum of a linear prediction filter corresponding to an original line spectral frequency parameter of the to-be-encoded signal is a weighting factor for calculating a distance between the original line spectral frequency parameter and a target original line spectral frequency parameter is calculated.

Method and apparatus for determining weighting factor during stereo signal encoding

Various embodiments provide a method and an apparatus for determining a weighting factor during stereo signal encoding. In those embodiments, a parameter value corresponding to the encoding mode of the to-be-encoded signal is determining based on an encoding mode of a to-be-encoded signal in a stereo signal and a correspondence between an encoding mode and a parameter value. Based on the determined parameter value and an energy spectrum of a linear prediction filter corresponding to an original line spectral frequency parameter of the to-be-encoded signal is a weighting factor for calculating a distance between the original line spectral frequency parameter and a target original line spectral frequency parameter is calculated.