Systems and Methods for Audio Upmixing
20220400351 · 2022-12-15
Assignee
Inventors
- Christos Kyriakakis (Venice, CA, US)
- Matthias Kronlachner (Venice, CA, US)
- Lasse Vetter (Venice, CA, US)
Cpc classification
H04S5/005
ELECTRICITY
H04R5/04
ELECTRICITY
International classification
H04S5/00
ELECTRICITY
Abstract
Systems and methods for audio in accordance with embodiments of the invention are illustrated. One embodiment includes a method for upmixing audio, including receiving an audio track which includes an input plurality of channels, each channel having an encoded audio signal, decoding the audio signal, calculating a first frequency spectrum for a low frequency component of the signal using a first window, calculating a second frequency spectrum for a high frequency component of the signal using a second window, determining at least one direct signal by estimating panning coefficients, estimating at least one ambient signal based on the at least one direct signal, and generating an output plurality of channels based on the at least one direct signal and the at least one ambient signal.
Claims
1. A method for upmixing audio, comprising: receiving an audio track comprising an input plurality of channels, each channel having an encoded audio signal; decoding the audio signals; calculating a first frequency spectrum for a low frequency component of the signal using a first window; calculating a second frequency spectrum for a high frequency component of the signal using a second window; determining at least one direct signal by estimating panning coefficients; estimating at least one ambient signal based on the at least one direct signal; and generating an output plurality of channels based on the at least one direct signal and the at least one ambient signal.
2. The method for upmixing audio, wherein the second plurality of channels comprises more channels than the first plurality of channels.
3. The method for upmixing audio of claim 1, further comprising determining a spatial representation of the audio track.
4. The method for upmixing audio of claim 1, wherein the input plurality of channels comprises two channels.
5. The method for upmixing audio of claim 4, wherein the two channels comprise a right and left channel.
6. The method for upmixing audio of claim 1, wherein the output plurality of channels comprises a center channel.
7. The method for upmixing audio of claim 6, wherein the center channel is determined using the at least one direct signal and the panning coefficients.
8. The method for upmixing audio of claim 1, wherein a decorrelation method is applied to the resulting surround channels.
9. The method for upmixing audio of claim 1, wherein a decorrelation method is applied to the resulting left and right channels.
10. The method for upmixing audio of claim 1, wherein the low frequency component comprises frequencies up to 1000 Hz.
11. I he method tor upmixing audio of claim 1, wherein calculating the first frequency spectrum and calculating the second frequency spectrum comprises using a Short-time Fourier transform (STFT).
12. The method for upmixing audio of claim 9, wherein the first window has a length suitable for the STFT to produce 2048 frequency coefficients.
13. The method for upmixing audio of claim 9, wherein the second window has a length suitable for the STFT to produce 128 frequency coefficients.
14. The method for upmixing audio of claim 1, further comprising smoothing the panning coefficients.
15. A system for upmixing audio, comprising: a processor; and a memory containing an upmixing application that configures the processor to: receive an audio track comprising an input plurality of channels, each channel having an encoded audio signal; decode the audio signals; calculate a first frequency spectrum for a low frequency component of the signal using a first window; calculate a second frequency spectrum for a high frequency component of the signal using a second window; determine at least one direct signal by estimating panning coefficients; estimate at least one ambient signal based on the at least one direct signal; and generate an output plurality of channels based on the at least one direct signal and the at least one ambient signal.
16. The system for upmixing audio of claim 15, wherein the second plurality of channels comprises more channels than the first plurality of channels.
17. The system for upmixing audio of claim 15, wherein the upmixing application further directs the processor to determine a spatial representation of the audio track.
18. The system for upmixing audio of claim 15, wherein the input plurality of channels comprises two channels.
19. The system for upmixing audio of claim 18, wherein the two channels comprise a right and left channel.
20. The system for upmixing audio of claim 15, wherein the output plurality of channels comprises a center channel.
21. The system for upmixing audio of claim 20, wherein the center channel is determined using the at least one direct signal and the panning coefficients.
22. The system for upmixing audio of claim 15, wherein the upmixing application further directs the processor to apply a decorrelation method to the resulting surround channels.
23. The system for upmixing audio of claim 15, wherein the upmixinq application further directs the processor to apply a decorrelation method to the resulting left and right channels
24. The system for upmixing audio of claim 15, wherein the low frequency component comprises frequencies up to 1000 Hz.
25. The system for upmixing audio of claim 151, wherein to calculate the first frequency spectrum and the second frequency spectrum, the upmixing application directs the processor to use a Short-time Fourier transform (STFT).
26. The system for upmixing audio of claim 25, wherein the first window has a length suitable for the STFT to produce 2048 frequency coefficients.
27. The system for upmixing audio of claim 25, wherein the second window has a length suitable for the STFT to produce 128 frequency coefficients.
28. The system for upmixing audio of claim 15, wherein the upmixing application further directs the processor to smooth the panning coefficients.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0035] The description and claims will be more fully understood with reference to the following figures and data graphs, which are presented as exemplary embodiments of the invention and should not be construed as a complete recitation of the scope of the invention.
[0036]
[0037]
[0038]
[0039]
[0040]
[0041]
[0042]
[0043]
[0044]
[0045]
DETAILED DESCRIPTION
[0046] Advancements in film sound have resulted in an increase in the number of audio channels. As a result, home surround sound systems are becoming more commonplace. Where homes may previously have only had 2-channel stereo systems, 5.1 surround sound and even higher order surround sound systems are now ubiquitous. However, music catalogues, are rarely in a surround sound format. For example, recordings made by the Beatles, often cited as the most influential band of all time, are in mono and stereo. As such, surround sound systems, and even some stereo systems, are unable to provide a surround sound experience when playing back Beatles recordings.
[0047] To remedy this, systems and methods described herein provide audio upmixing techniques that enable lower channel audio to be converted into higher channel audio without introducing significant, if any, distortion. Conventional methodologies tend to focus more on cinema audio, and be suboptimal for music reproduction. Further, conventional methodologies can introduce artifacts and/or other distortions to the played back audio. For many applications, systems and methods described herein may need to be performed in near-real time, and therefore increased efficiency over existing methods is beneficial.
[0048] For example, home surround sound systems are often provided music as a source input that is not in 1:1 channel format with the speaker layout, but the listener expects for the music they've selected to be immediately played back from all the loudspeakers in their system. As such, a track may need to be upmixed into a higher number of channels immediately with as little lag as possible. Systems and methods described herein can upmix audio tracks to higher channel formats in near real time.
[0049] The Discrete Fourier Transform (DFT) is a mathematical method used to analyze the frequency content of audio signals. The Fast Fourier Transform (FFT) is an efficient computational implementation of the DFT that reduces the number of mathematical operations needed for the analysis. In many embodiments, the entire signal is not known in advance. For example, when music is streaming from the internet digital audio samples are arriving continuously in time. The Short-time Fourier Transform (STFT) can be used to determine frequency and phase content of specific time portions (time slices) of the audio signal. The STFT computes the FFT of consecutive time slices of the incoming signal and calculates the frequency content of the signal continuously in time. One issue with STFTs (and the Fourier Transform in general) is that the transform has a fixed resolution. Specifically, the number of coefficients used in the analysis (“FFT Length”) determines the frequency resolution of the analyzed frequency content of the signal. In the STFT case, the consecutive time slices are composed of a number of digital audio samples, N, and this slicing process is achieved through the use of a windowing function (“a window”). The number of audio samples per second is called the sampling rate, f.sub.s. When the number of coefficients of the FFT is set to be equal to the window size (N), the resulting spacing between analyzed frequencies (frequency resolution) of the FFT is f.sub.s/N. That implies that as the number of FFT coefficients (N) increases, the FFT has the ability to resolve frequencies that are closer together. However, an increase in the number of coefficients, N, implies that the size of the window used to create the time slices becomes larger. This results in a reduction of the ability to resolve rapid time changes of the audio signal. This time-frequency resolution tradeoff is one of the fundamental properties of the Fourier Transform. A wider window gives a better frequency resolution, but a worse time resolution. Conversely, a narrower window gives better time resolution, but a worse frequency resolution. An additional downside of using an STFT window that yields high frequency resolution is that significantly more computations are typically performed in order to analyze the frequency content. Systems and methods described herein can leverage this deficiency to increase computational efficiency while maintaining quality by extracting from the audio signals for each channel a number of frequency bands that can then be separately processed.
[0050] In various embodiments, the frequency bands are selected by identifying frequency ranges that benefit from high resolution in time and those that benefit from high resolution in frequency. The bands that benefit from high resolution in frequency tend to be lower frequency bands, which can be allocated more compute resources. The power spectra of lower frequency bands in musical audio signals tend to change much more slowly than higher frequencies, but changes in frequency within lower frequency bands are much more noticeable to the human ear (e.g. the perceived difference between a 50 Hz audio signal and a 53 Hz audio signal is significantly more noticeable than from the difference between a 5000 Hz audio signal and a 5003 Hz audio signal). As such, high resolution in frequency is typically more important than high resolution in time for low frequency audio signals in music. In contrast, the power spectra of higher frequency audio signals (where most melody instruments tend to reside, including the human voice) tend to change more rapidly in time, and so high resolution in time is typically more important than high resolution in frequency at higher frequency bands. As is discussed further below, extracting different frequency bands and determining the power spectra of the frequency bands by applying STFT processes using different length time windows to achieve different tradeoffs between frequency and time resolution can reduce processing load within a processing system (e.g. a CPU), and in many embodiments, increase the parallelizability of the processing. As a result, systems and methods in accordance with many embodiments of the invention can achieve low latency, near real-time upmixing of audio signals.
[0051] By way of example, turning now to
Audio Upmixing Processes
[0052] Audio upmixing processes can involve converting an audio track with a given number of channels to a version of the audio track with a higher number of channels. In many embodiments, audio upmixing processes described herein can operate in real time. For example, processes described herein can upmix a stereo audio stream to a 5.1 channel stream which is played back using speakers designed and/or placed to render 5.1 channel audio without noticeable latency to the user. As can be readily appreciated, a stereo to 5.1 upmix is merely an example, and any arbitrary number of channels can be upmixed using processes described herein. However, in order to provide a concrete example to enhance understanding, an upmix from stereo to 5.1 channel surround sound is used as an example below.
[0053] Turning now to
[0054] Same frequency band L and R channel pairs are split (230) into frames. In many embodiments, frames are generated using a sliding window. The window size can be dependent upon what frequency band is being processed. For example, a high frequency band may have a smaller window size (and therefore frame size) because, when performing an STFT (240) on the frame, high frequencies need high resolution in time but low resolution in frequency, whereas low frequencies need a low resolution in time but higher resolution in frequency.
[0055] In many embodiments, the window sizes are allocated such that the high frequency window yields a first number of spectral coefficients (e.g. 128 or fewer spectral coefficients), and the low frequency window yields a second larger number of spectral coefficients (e.g. 2048 or more spectral coefficients). The specific number of spectral frequency coefficients that are generated with respect to each frequency band (and the number of frequency bands) is largely dependent upon the requirements of specific applications in accordance with various embodiments of the invention, and may be tuned based on the particular piece of content and available computational resources. For example, different musical genres may be accounted for using different numbers of spectral coefficients. Indeed, in a number of embodiments the characteristics (e.g. genre) of the music can be specified and/or detected and parameters such as (but not limited to) frequency cutoff(s), and/or number(s) of spectral coefficients with respect to one or more of the frequency bands can be adapted based upon the characteristics of the music. Further, as noted above, multiple frequency bands can be generated, and therefore different window sizes can be used as appropriate to the requirements of specific applications in accordance with various embodiments of the invention. In numerous embodiments, the window utilized to determine the FFT of a given spectral band (e.g. using an STFT) operates in a sliding window fashion and may overlap previously processed samples from the signal. In some embodiments, the window contains between 40%-60% of samples from samples utilized to determine the FFT of the spectral band (e.g. using an STFT) during a previous time window. However, this number can be adjusted depending on the type of content being processed, the frequency band being processed, and/or any other parameter as appropriate to the requirements of specific applications in accordance with various embodiments of the invention. This splitting can provide significant computational efficiency increases because, as noted, Fourier transforms break up a frequency range into spectral coefficients (or frequency sub-bands called bins), and processing requirements are roughly the square of the number of spectral coefficients.
[0056] In many embodiments, the Fourier transform is a Fast Fourier transform (FFT), which may be an implementation of a Short-time Fourier transform (STFT). The frequency components corresponding to the spectral coefficients can be assigned (250) to new channels. An inverse Fourier transform (e.g. an inverse STFT, called iSTFT) can be performed (260) on the spectral coefficients in each new channel to produce new audio signals for each channel. These new audio signals can then be output (270).
[0057] Assigning frequency components to new channels can be performed in a number of ways. Turning now to
[0058] Panning coefficients for the L and R channels are estimated (320). In many embodiments, the stereo signals are represented as a weighted sum of J source signals d.sub.j(n) and a term that corresponds to an uncorrelated ambient signal n.sub.L(n):
Panning coefficients a.sub.L.sub.
a.sub.L.sub.
In the frequency domain, after application of a Fourier transform (e.g. an STFT), the signal model is given as:
[0059] In many embodiments, it is assumed that at any given time instant b, and frequency band k, only one dominant source D is active in the track. In various embodiments, it is assumed that the ambient left and right signals have the same amplitude, but different phase (φ) due to variations in path lengths that arise from room acoustic reflections:
N.sub.L(b, k)=N(b, k), N.sub.R(b, k)=e.sub.jϕ.Math.N(b,k)
From the above, a simplified signal model can be written as:
N.sub.L(b, k)=a.sub.L(b, k)D(b, k)+N(b, k)
N.sub.R(b, k)=a.sub.R(b, k)D(b, k)+e.sup.jϕN(b, k)
[0060] However, it is to be understood that each equation is computed for each time frequency bin as above. As the magnitude of the ambient signal can be assumed to be significantly smaller than that of the direct signal, let:
|X.sub.L(b, k)|≈a.sub.L(b, k)|D(b, k)|
|X.sub.R(b, k)|≈a.sub.R(b, k)|D(b, k)|
when, which combined with the power summing condition of the panning coefficients, gives an estimate of each coefficient based on the magnitudes of the original left and right channels:
[0061] In many embodiments, the rate of change between consecutive STFT frames is too fast which can cause audible distortion. In order to resolve this, the estimates of the panning coefficients â.sub.L and â.sub.R are smoothed (330) over time. In numerous embodiments, smoothing is achieved using an exponential moving averaging filter:
â.sub.L(b, k)=γ.sub.L(b, k)ã.sub.L(b, k)+(1−γ.sub.L(b, k)) ã.sub.L(b−1, k)
â.sub.R(b, k)=γ.sub.R(b, k)ã.sub.R(b, k)+(1−γ.sub.R(b, k)) ãR(b−1, k)
where γ is a smoothing coefficient which can be tuned to minimize distortion. However, in some embodiments, smoothing can reduce variance which tends to pull audio towards the center channel. In various embodiments, this is rectified using a different smoothing coefficient (γ.sub.1 or γ.sub.2) with a decision-directed approach which reduces artifacts while preserving a wide sound stage. That is, the value for y may change for each STFT bin calculation. The decision-directed approach can be formalized as:
If ã.sub.L(b, k)>â.sub.L(b−1, k); then γ.sub.L=γ.sub.1; else γ.sub.L=γ.sub.2
If ã.sub.R(b, k)>â.sub.R(b−1, k); then γ.sub.R=γ.sub.1; else γ.sub.R=γ.sub.2
[0062] For notational simplicity, (b,k) is dropped in the equations below. Using the panning coefficients, direct and ambient components can be estimated (340). In many embodiments, using the panning coefficients in the above simplified signal model and solving for direct and ambient signals gives the following estimates:
[0063] With the estimate of the direct component from the generalized model above, a left, center and right channel can be derived (350) from the original stereo channels (L and R) using vector analysis:
X.sub.L=L+√{square root over (0.5)}C
X.sub.R=R+√{square root over (0.5)}C
In many embodiments, it is assumed that the ambient components are uncorrelated and that the L and R components do not usually contain a common dominant source, so:
L.Math.R=0
which can be written using the above equation as:
(X.sub.L−√{square root over (05)}C).Math.(X.sub.R−√{square root over (0.5)}C)=0
This produces a quadratic equation for |C|. In many embodiments, the solution with the negative sign (for minimum energy) is selected to find |C| (but it is not required):
|C|=√{square root over (0.5)}(|X.sub.L+X.sub.R|−|X.sub.L−X.sub.R|)
The C channel component can be represented as a vector in the direction of the vector sum of X.sub.L+X.sub.R and is weighted by the magnitude estimate |C|:
In many embodiments, the center channel can alternatively be estimated instead by using: D.sub.L=a.sub.L×D and D.sub.R=a.sub.R×D to estimate |C| and C using the panning coefficients above. Once the center channel is determined, new L and R channels can be found by subtracting the Center channel from the original L and R:
L=X.sub.L−√{square root over (0.5)}C
R=X.sub.R=√{square root over (0.5)}C
[0064] Left and right surround channels are assigned (360) as the left and right ambient estimates above. In some embodiments, it is advantageous to further process the surround channels using decorrelation. While some degree of decorrelation is achieved through the addition of a phase rotation in one of the two channels, several other methods for decorrelation can be used. In some embodiments in which a realistic acoustic reproduction is desired, the L, R, and C channels are intended to be precisely localized by the listener while the surround channels (LS and RS) are intended to sound diffuse and not localizable. This can be achieved by adding a decorrelation processing block to the surround signals prior to directing them to the loudspeakers. Decorrelation methods include phase changes, frequency-dependent delay, frequency subband based randomization of phase, all-pass filters and other methods. These methods can be particularly advantageous when the surround channel is directed to a single loudspeaker behind the listener as is described in U.S. patent application Ser. No. 16/839,021 titled “Systems and Methods for Spatial Audio Rendering”. In some embodiments, decorrelation can be applied to the upmixed X.sub.L and X.sub.R signals to enhance the spatial impression of the track when all of the upmixed channels are reproduced from a single loudspeaker (as is described in U.S. patent application Ser. No. 16/839,021 titled “Systems and Methods for Spatial Audio Rendering”) placed in front of the listener.
[0065] While a particular method for upmixing and assigning frequencies to new channels are illustrated in
Upmixing Systems
[0066] Upmixing systems in accordance with many embodiments of the system can upmix audio tracks in near real time to enable a pleasing live listening experience on surround sound audio setups being fed by suboptimal input channel configurations. In many embodiments, the upmixing is performed on streaming media content with an imperceptible amount of latency as experienced by the listener. However, upmixing systems can perform on any number of tracks provided in a non-live context as well.
[0067] Turning now to
[0068] Further, in many embodiments, the connected speaker layout may be a spatial audio system such as that described in U.S. patent application Ser. No. 16/839,021. In various embodiments, the audio upmixer can provide upmixed audio as input to a virtual speaker layout used to render spatial audio. An audio upmixer connected to an example spatial audio system in accordance with an embodiment of the invention is illustrated in
[0069] Turning now to
[0070] The audio upmixer 1000 further includes a memory 1030. The memory can be implemented using volatile memory, nonvolatile memory, or any combination thereof. The memory contains an upmixing application 1032 which can configure the processor to perform various audio upmixing processes. In many embodiments, the memory further contains audio data 1034 which describes one or more audio tracks, and/or a filter bank 1036. In many embodiments, the filter bank is a data structure that contains a list of different bandpass filters to use in splitting channels as described above. However, in many embodiments, the filter bank can be implemented as its own distinct circuit.
[0071] While particular audio upmixing systems are illustrated in