Patent classifications
G10L19/022
METHODS AND APPARATUS TO PERFORM AUDIO WATERMARKING AND WATERMARK DETECTION AND EXTRACTION
Methods and apparatus to perform audio watermarking and watermark detection and extraction are disclosed. Example apparatus disclosed herein are to select frequency components to be used to represent a code, different sets of frequency components to represent respectively different information, respective ones of the frequency components in the sets of frequency components located in respective code bands, there being multiple code bands and spacing between adjacent code bands being equal to or less than the spacing between adjacent frequency components in the code bands. Disclosed example apparatus are also to synthesize the frequency components to be used to represent the code, combine the synthesized frequency components with an audio block of an audio signal, and output the audio signal and a video signal associated with the audio signal.
OVERSAMPLING IN A COMBINED TRANSPOSER FILTERBANK
The present invention relates to coding of audio signals, and in particular to high frequency reconstruction methods including a frequency domain harmonic transposer. A system and method for generating a high frequency component of a signal from a low frequency component of the signal is described. The system comprises an analysis filter bank (501) comprising an analysis transformation unit (601) having a frequency resolution of Δf; and an analysis window (611) having a duration of D.sub.A; the analysis filter bank (501) being configured to provide a set of analysis subband signals from the low frequency component of the signal; a nonlinear processing unit (502, 650) configured to determine a set of synthesis subband signals based on a portion of the set of analysis subband signals, wherein the portion of the set of analysis subband signals is phase shifted by a transposition order T; and a synthesis filter bank (504) comprising a synthesis transformation unit (602) having a frequency resolution of QΔf; and a synthesis window (612) having a duration of D.sub.s; the synthesis filter bank (504) being configured to generate the high frequency component of the signal from the set of synthesis subband signals; wherein Q is a frequency resolution factor with Q≥1 and smaller than the transposition order T; and wherein the value of the product of the frequency resolution Δf and the duration D.sub.A of the analysis filter bank is selected based on the frequency resolution factor Q.
OVERSAMPLING IN A COMBINED TRANSPOSER FILTERBANK
The present invention relates to coding of audio signals, and in particular to high frequency reconstruction methods including a frequency domain harmonic transposer. A system and method for generating a high frequency component of a signal from a low frequency component of the signal is described. The system comprises an analysis filter bank (501) comprising an analysis transformation unit (601) having a frequency resolution of Δf; and an analysis window (611) having a duration of D.sub.A; the analysis filter bank (501) being configured to provide a set of analysis subband signals from the low frequency component of the signal; a nonlinear processing unit (502, 650) configured to determine a set of synthesis subband signals based on a portion of the set of analysis subband signals, wherein the portion of the set of analysis subband signals is phase shifted by a transposition order T; and a synthesis filter bank (504) comprising a synthesis transformation unit (602) having a frequency resolution of QΔf; and a synthesis window (612) having a duration of D.sub.s; the synthesis filter bank (504) being configured to generate the high frequency component of the signal from the set of synthesis subband signals; wherein Q is a frequency resolution factor with Q≥1 and smaller than the transposition order T; and wherein the value of the product of the frequency resolution Δf and the duration D.sub.A of the analysis filter bank is selected based on the frequency resolution factor Q.
SOUND SIGNAL ENCODING METHOD, SOUND SIGNAL DECODING METHOD, SOUND SIGNAL ENCODING APPARATUS, SOUND SIGNAL DECODING APPARATUS, PROGRAM, AND RECORDING MEDIUM
A downmix unit 110 obtains downmix signals which are signals obtained by mixing input sound signals of a left channel input and input sound signals of a right channel input. A left channel signal subtraction unit 130 and a right channel signal subtraction unit 150 code the difference between the input sound signals and a multiplication value of the downmix signals and a subtraction gain for each of the left channel and the right channel. In such a configuration, a left channel subtraction gain estimation unit 120 and a right channel subtraction gain estimation unit 140 determine the subtraction gain such that the quantization errors resulting from the two processes of coding/decoding are reduced.
SOUND SIGNAL ENCODING METHOD, SOUND SIGNAL DECODING METHOD, SOUND SIGNAL ENCODING APPARATUS, SOUND SIGNAL DECODING APPARATUS, PROGRAM, AND RECORDING MEDIUM
A downmix unit 110 obtains downmix signals which are signals obtained by mixing input sound signals of a left channel input and input sound signals of a right channel input. A left channel signal subtraction unit 130 and a right channel signal subtraction unit 150 code the difference between the input sound signals and a multiplication value of the downmix signals and a subtraction gain for each of the left channel and the right channel. In such a configuration, a left channel subtraction gain estimation unit 120 and a right channel subtraction gain estimation unit 140 determine the subtraction gain such that the quantization errors resulting from the two processes of coding/decoding are reduced.
Apparatus and method for encoding or decoding an audio signal using a transient-location dependent overlap
An apparatus for encoding an audio or image signal, includes: a controllable windower for windowing the audio or image signal to provide the sequence of blocks of windowed samples; a converter for converting the sequence of blocks of windowed samples into a spectral representation including a sequence of frames of spectral values; a transient location detector for identifying a location of a transient within a transient look-ahead region of a frame; and a controller for controlling the controllable windower to apply a specific window having a specified overlap length to the audio or image signal in response to an identified location of the transient, wherein the controller is configured to select the specific window from a group of at least three windows, wherein the specific window is selected based on the transient location.
Apparatus and method for encoding or decoding an audio signal using a transient-location dependent overlap
An apparatus for encoding an audio or image signal, includes: a controllable windower for windowing the audio or image signal to provide the sequence of blocks of windowed samples; a converter for converting the sequence of blocks of windowed samples into a spectral representation including a sequence of frames of spectral values; a transient location detector for identifying a location of a transient within a transient look-ahead region of a frame; and a controller for controlling the controllable windower to apply a specific window having a specified overlap length to the audio or image signal in response to an identified location of the transient, wherein the controller is configured to select the specific window from a group of at least three windows, wherein the specific window is selected based on the transient location.
HYPOTHESIS STITCHER FOR SPEECH RECOGNITION OF LONG-FORM AUDIO
A hypothesis stitcher for speech recognition of long-form audio provides superior performance, such as higher accuracy and reduced computational cost. An example disclosed operation includes: segmenting the audio stream into a plurality of audio segments; identifying a plurality of speakers within each of the plurality of audio segments; performing automatic speech recognition (ASR) on each of the plurality of audio segments to generate a plurality of short-segment hypotheses; merging at least a portion of the short-segment hypotheses into a first merged hypothesis set; inserting stitching symbols into the first merged hypothesis set, the stitching symbols including a window change (WC) symbol; and consolidating, with a network-based hypothesis stitcher, the first merged hypothesis set into a first consolidated hypothesis. Multiple variations are disclosed, including alignment-based stitchers and serialized stitchers, which may operate as speaker-specific stitchers or multi-speaker stitchers, and may further support multiple options for differing hypothesis configurations.
HYPOTHESIS STITCHER FOR SPEECH RECOGNITION OF LONG-FORM AUDIO
A hypothesis stitcher for speech recognition of long-form audio provides superior performance, such as higher accuracy and reduced computational cost. An example disclosed operation includes: segmenting the audio stream into a plurality of audio segments; identifying a plurality of speakers within each of the plurality of audio segments; performing automatic speech recognition (ASR) on each of the plurality of audio segments to generate a plurality of short-segment hypotheses; merging at least a portion of the short-segment hypotheses into a first merged hypothesis set; inserting stitching symbols into the first merged hypothesis set, the stitching symbols including a window change (WC) symbol; and consolidating, with a network-based hypothesis stitcher, the first merged hypothesis set into a first consolidated hypothesis. Multiple variations are disclosed, including alignment-based stitchers and serialized stitchers, which may operate as speaker-specific stitchers or multi-speaker stitchers, and may further support multiple options for differing hypothesis configurations.
Method and Apparatus for Determining Inter-Channel Time Difference Parameter
A method for determining an inter-channel time difference (ITD) parameter includes determining a reference parameter according to a time-domain signal on a first sound channel and a time-domain signal on a second sound channel, where the reference parameter corresponds to a sequence of obtaining the time-domain signal on the first sound channel and the time-domain signal on the second sound channel, determining a search range according to the reference parameter and a limiting value (T.sub.max), where the T.sub.max is determined according to a sampling rate of the time-domain signal on the first sound channel, and performing search processing within the search range based on a frequency-domain signal on the first sound channel and a frequency-domain signal on the second sound channel to determine a first ITD parameter corresponding to the first sound channel and the second sound channel.