Binaural Dialogue Enhancement
20220060838 · 2022-02-24
Assignee
- Dolby Laboratories Licensing Corporation (San Francisco, CA)
- Dolby International Ab (Amsterdam Zuidoost, NL)
Inventors
- Leif Jonas Samuelsson (Sundbyberg, SE)
- Dirk Jeroen Breebaart (Ultimo, AU)
- David Matthew Cooper (Carlton, AU)
- Jeroen Koppens (Nederweert, NL)
Cpc classification
H04S3/00
ELECTRICITY
H04S1/002
ELECTRICITY
H04S2420/01
ELECTRICITY
H04S2420/03
ELECTRICITY
H04R5/04
ELECTRICITY
H04S3/02
ELECTRICITY
H04S3/008
ELECTRICITY
International classification
H04R5/04
ELECTRICITY
H04S3/00
ELECTRICITY
Abstract
Methods for dialogue enhancing audio content, comprising providing a first audio signal presentation of the audio components, providing a second audio signal presentation, receiving a set of dialogue estimation parameters configured to enable estimation of dialogue components from the first audio signal presentation, applying said set of dialogue estimation parameters to said first audio signal presentation, to form a dialogue presentation of the dialogue components; and combining the dialogue presentation with said second audio signal presentation to form a dialogue enhanced audio signal presentation for reproduction on the second audio reproduction system, wherein at least one of said first and second audio signal presentation is a binaural audio signal presentation.
Claims
1. A method of processing immersive audio content, comprising: receiving a first audio signal presentation of the immersive audio content, the first audio signal presentation configured to reproduce on a first audio reproduction system; receiving a second audio signal presentation of the immersive audio content, the second audio signal presentation configured to reproduce on a second audio reproduction system; receiving a set of dialogue estimation parameters configured to enable estimation of dialogue components from the first audio signal presentation; forming a dialogue presentation of the dialogue components by applying the set of dialogue estimation parameters to the first audio signal presentation; and combining the dialogue presentation with the second audio signal presentation to form a dialogue enhanced audio signal presentation for reproduction on the second audio reproduction system, wherein at least one of the first or second audio signal presentation is a binaural audio signal presentation.
2. The method of claim 1, wherein the immersive audio content includes one or more spatial audio components.
3. The method of claim 1, wherein both said first and second audio signal presentations are binaural audio signal presentations.
4. The method of claim 1, wherein only one of said first and second audio signal presentation is a binaural audio signal presentation.
5. A system comprising: one or more processors; and a non-transitory computer readable medium storing instructions that, upon execution by the one or more processors, cause the one or more processors to perform operations of dialogue enhancing immersive audio content, the operations comprising: receiving a first audio signal presentation of the immersive audio content, the first audio signal presentation configured to reproduce on a first audio reproduction system; receiving a second audio signal presentation of the immersive audio content, the second audio signal presentation configured to reproduce on a second audio reproduction system; receiving a set of dialogue estimation parameters configured to enable estimation of dialogue components from the first audio signal presentation; forming a dialogue presentation of the dialogue components by applying the set of dialogue estimation parameters to the first audio signal presentation; and combining the dialogue presentation with the second audio signal presentation to form a dialogue enhanced audio signal presentation for reproduction on the second audio reproduction system, wherein at least one of the first or second audio signal presentation is a binaural audio signal presentation.
6. A non-transitory computer readable medium storing instructions that, upon execution by the one or more processors, cause one or more processors to perform operations of dialogue enhancing immersive audio content, the operations comprising: receiving a first audio signal presentation of the immersive audio content, the first audio signal presentation configured to reproduce on a first audio reproduction system; receiving a second audio signal presentation of the immersive audio content, the second audio signal presentation configured to reproduce on a second audio reproduction system; receiving a set of dialogue estimation parameters configured to enable estimation of dialogue components from the first audio signal presentation; forming a dialogue presentation of the dialogue components by applying the set of dialogue estimation parameters to the first audio signal presentation; and combining the dialogue presentation with the second audio signal presentation to form a dialogue enhanced audio signal presentation for reproduction on the second audio reproduction system, wherein at least one of the first or second audio signal presentation is a binaural audio signal presentation.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0029] Embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawings in which:
[0030]
[0031]
[0032]
[0033]
[0034]
[0035]
[0036]
[0037]
[0038]
[0039]
[0040]
[0041]
[0042]
DETAILED DESCRIPTION
[0043] Systems and methods disclosed in the following may be implemented as software, firmware, hardware or a combination thereof. In a hardware implementation, the division of tasks referred to as “stages” in the below description does not necessarily correspond to the division into physical units; to the contrary, one physical component may have multiple functionalities, and one task may be carried out by several physical components in cooperation. Certain components or all components may be implemented as software executed by a digital signal processor or microprocessor, or be implemented as hardware or as an application-specific integrated circuit. Such software may be distributed on computer readable media, which may comprise computer storage media (or non-transitory media) and communication media (or transitory media). As is well known to a person skilled in the art, the term computer storage media includes both volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Further, it is well known to the skilled person that communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.
[0044] Various ways to implement embodiments of the invention will be discussed with reference to
[0045] In the presented embodiments the input signals are preferably analyzed in time/frequency tiles, for example by means of a filter bank such as a quadrature mirror filter (QMF) bank, a discrete Fourier transform (DFT), a discrete cosine transform (DCT), or any other means to split input signals into a variety of frequency bands. The result of such a transform is that an input signal x.sub.i [n] for input with index i and discrete-time index n is represented by sub-band signals x.sub.i[b,k] for time slot (or frame) k and sub-band b. Consider for example the estimation of the binaural dialogue presentation from a stereo presentation. Let x.sub.j[b, k],j=1, 2 denote the sub-band signals of the left and right stereo channels, and {circumflex over (d)}.sub.i[b,k], i=1, 2 denote the sub-band signals of the estimated left and right binaural dialogue signals. The dialogue estimate may be computed like
with B.sub.p, K sets of frequency (b) and time (k) indices corresponding to a desired time/frequency tile, p the parameter band index, and m a convolution tap index, and w.sub.ijm.sup.B.sup.
[0046] The dialogue parameters w may be computed in the encoder, and encoded using techniques disclosed in U.S. Provisional Patent Application Ser. No. 62/209,735, filed Aug. 25, 2015, hereby incorporated by reference. The parameters w are then transmitted in the bitstream and decoded by a decoder prior to application using the above equation. Due to the linear nature of the estimate the encoder computation can be implemented using minimum mean squared error (MMSE) methods in cases where the target signal (the clean dialogue or an estimate of the clean dialogue) is available.
[0047] The choice of P, and the choice of the number of time slots in K is a trade-off between quality and bit rate. Furthermore, the parameters w can be constrained in order to lower the bit rate (at the cost of lower quality), e.g., by assuming w.sub.ijm.sup.B.sup.
[0048] In general it is proposed to use estimators of the form
where at least one of ŷ and x is a binaural signal, i.e., I=2 or J=2 or I=J=2. For notational convenience we will in the following often omit the time/frequency tile indexing B.sub.p, K as well as the i,j,m indexing when referring to different parameter sets used to estimate dialogue.
[0049] The above estimator can conveniently be expressed in matrix notation as (omitting the time/frequency tile indexing for ease of notation)
where X.sub.m=[x.sub.1(m) . . . x.sub.j(m)] and Ŷ=[ŷ.sub.1 . . . ŷ.sub.I] contain vectorized versions of x.sub.j[b, k−m] and ŷ.sub.i[b, k] respectively in the columns, and W.sub.m is a parameter matrix with J rows and I columns. The above form of the estimator may be used when performing only dialogue extraction, or when performing only a presentation transform, as well as in the case where both extraction and presentation transform is done using a single set of parameters as is detailed in embodiments below.
[0050] With reference to
[0051] According to the present invention, at least one of the presentations is a binaural presentation (echoic or anechoic). As will be further discussed in the following, the first and second presentations may be different, and the dialogue presentation may or may not correspond to the second presentation. For example, the first audio signal presentation may be intended for playback on a first audio reproduction system, e.g. a set of loudspeakers, while the second audio signal presentation may be intended for playback on a second audio reproduction system, e.g. headphones.
Single Presentation
[0052] In the decoder embodiment in
[0053] In the embodiment in
Two Presentations
[0054] In the decoder embodiment in
[0055] As indicated in
[0056] In
[0057] Further, it is noted that the dialogue extraction can be one dimensional, such that the extracted dialogue is a mono representation. The transform parameters D2 are then positional metadata, and the presentation transform comprises rendering the mono dialogue using HRTFs, HRIRs or BRIRs corresponding to the position. Alternatively, if the desired rendered dialogue presentation is intended for loudspeaker playback, the mono dialogue could be rendered using loudspeaker rendering techniques such as amplitude panning or vector-based amplitude panning (VBAP).
Simulcast Implementation
[0058]
[0059] As illustrated in
[0060] In the embodiment in
[0061]
[0062] In the embodiment in
[0063]
[0064] It is noted that the set of parameters w(D1) may be identical to the dialogue enhancement parameters used to provide dialogue enhancement of the stereo signal in a simulcast implementation. This alternative is illustrated in
[0065]
[0066] In one embodiment, the aforementioned dedicated presentation transform w(D2) in
[0067]
[0068] It is noted that combining signals with different presentations, e.g., summing a stereo dialogue signal to a binaural signal (which contains non-enhanced binaural dialogue components) naturally leads to spatial imaging artifacts since the non-enhanced binaural dialogue components are perceived to be spatially different compared to a stereo presentation of the same components.
[0069] It is further noted that combining signals with different presentations can lead to constructive summing of dialogue components in certain frequency bands, and destructive summing in other frequency bands. The reason for this is that binaural processing introduces ITDs (phase differences) and we are summing signals that are in-phase in certain frequency bands and out-of-phase in other bands, leading to coloring artifacts in the dialogue components (moreover the coloring can be different in the left and right ear). In one embodiment, phase differences above the phase/magnitude cut-off frequency are avoided in the binaural processing so as to reduce this type of artifact.
[0070] As a final note to the case of combining signals with different presentations it is acknowledged that in general, binaural processing can reduce the intelligibility of dialogue. In cases where the goal of dialogue enhancement is to maximize intelligibility, it may be advantageous to extract and level modify (e.g. boost) a dialogue signal that is non-binaural. To elaborate further, even if the final presentation intended for playback is binaural, it may be advantageous in such a case to extract and level modify (e.g. boost) a stereo dialogue signal and combine that with the binaural presentation (trading off coloring artifacts and spatial imaging artifacts as described above, for increased intelligibility).
[0071] In the embodiment in
[0072]
[0073] In some applications, it may be desirable to apply different processing depending on the desired value of the dialogue level modification factor G. In one embodiment, example, appropriate processing is selected based on a determination of whether the factor G is greater than or smaller than a given threshold. Of course, there may also be more than one threshold, and more than one alternative processing. For example, a first processing when G<th1, a second processing when th1<=G<th2, and a third processing when G>=th2, where th1 and th2 are two given threshold values.
[0074] In a specific example, illustrated in
[0075] When the switch is in position A, the circuit is here configured to combine the estimated stereo dialogue from matrix transform 86 with the stereo signal z, and then perform the matrix transform 73 on the combined signal to generate a reconstructed anechoic binaural signal. The output from the feedback delay network 75 is then combined with this signal in 78. It is noted that this processing essentially corresponds to
[0076] When the switch is in position B, the circuit is here configured to apply transform parameters w(D2) to the stereo dialogue from matrix transform 86 in order to provide a binaural dialogue estimation. This estimation is then added to the anechoic binaural signal from transform 73, and output from the feedback delay network 75. It is noted that this processing essentially corresponds to
[0077] The skilled person will realize many other alternatives for the processing in position A and B, respectively. For example, the processing when the switch is in position B could instead correspond to that in
Interpretation
[0078] Reference throughout this specification to “one embodiment”, “some embodiments” or “an embodiment” means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, appearances of the phrases “in one embodiment”, “in some embodiments” or “in an embodiment” in various places throughout this specification are not necessarily all referring to the same embodiment, but may. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner, as would be apparent to one of ordinary skill in the art from this disclosure, in one or more embodiments.
[0079] As used herein, unless otherwise specified the use of the ordinal adjectives “first”, “second”, “third”, etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.
[0080] In the claims below and the description herein, any one of the terms comprising, comprised of or which comprises is an open term that means including at least the elements/features that follow, but not excluding others. Thus, the term comprising, when used in the claims, should not be interpreted as being limitative to the means or elements or steps listed thereafter. For example, the scope of the expression a device comprising A and B should not be limited to devices consisting only of elements A and B. Any one of the terms including or which includes or that includes as used herein is also an open term that also means including at least the elements/features that follow the term, but not excluding others. Thus, including is synonymous with and means comprising.
[0081] As used herein, the term “exemplary” is used in the sense of providing examples, as opposed to indicating quality. That is, an “exemplary embodiment” is an embodiment provided as an example, as opposed to necessarily being an embodiment of exemplary quality.
[0082] It should be appreciated that in the above description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, FIG., or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the Detailed Description are hereby expressly incorporated into this Detailed Description, with each claim standing on its own as a separate embodiment of this invention.
[0083] Furthermore, while some embodiments described herein include some but not other features included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention, and form different embodiments, as would be understood by those skilled in the art. For example, in the following claims, any of the claimed embodiments can be used in any combination.
[0084] Furthermore, some of the embodiments are described herein as a method or combination of elements of a method that can be implemented by a processor of a computer system or by other means of carrying out the function. Thus, a processor with the necessary instructions for carrying out such a method or element of a method forms a means for carrying out the method or element of a method. Furthermore, an element described herein of an apparatus embodiment is an example of a means for carrying out the function performed by the element for the purpose of carrying out the invention.
[0085] In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In other instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
[0086] Similarly, it is to be noticed that the term coupled, when used in the claims, should not be interpreted as being limited to direct connections only. The terms “coupled” and “connected,” along with their derivatives, may be used. It should be understood that these terms are not intended as synonyms for each other. Thus, the scope of the expression a device A coupled to a device B should not be limited to devices or systems wherein an output of device A is directly connected to an input of device B. It means that there exists a path between an output of A and an input of B which may be a path including other devices or means. “Coupled” may mean that two or more elements are either in direct physical or electrical contact, or that two or more elements are not in direct contact with each other but yet still co-operate or interact with each other.
[0087] Thus, while there has been described specific embodiments of the invention, those skilled in the art will recognize that other and further modifications may be made thereto without departing from the spirit of the invention, and it is intended to claim all such changes and modifications as falling within the scope of the invention. For example, any formulas given above are merely representative of procedures that may be used. Functionality may be added or deleted from the block diagrams and operations may be interchanged among functional blocks. Steps may be added or deleted to methods described within the scope of the present invention.