Pre-processing of a channelized music signal
09848266 ยท 2017-12-19
Assignee
Inventors
Cpc classification
H04S2420/07
ELECTRICITY
G10H2210/056
PHYSICS
H04R5/04
ELECTRICITY
G10H2210/305
PHYSICS
H04S5/00
ELECTRICITY
H04S2400/05
ELECTRICITY
H04R2205/041
ELECTRICITY
H04R2430/03
ELECTRICITY
International classification
H04S3/00
ELECTRICITY
H04R5/04
ELECTRICITY
H04S5/00
ELECTRICITY
Abstract
A method for pre-processing a channelized music signal to improve perception and appreciation for a hearing prosthesis recipient. In one example, the channelized music signal is a stereo input signal. A device, such as a handheld device, hearing prosthesis, or audio cable, for example, applies a mask to a stereo input signal to extract a center-mixed component from the stereo signal and outputs an output signal comprised of a weighted combination of the extracted center-mixed component and a residual signal comprising a non-extracted part of the stereo input signal. The center-mixed component may contain components, such as leading vocals and/or drums, preferred by hearing prosthesis recipients relative to other components, such as backing vocals or other instruments.
Claims
1. A method comprising: applying a mask to a stereo input signal to extract a center-mixed component from the stereo signal; and generating an output signal comprised of a weighted combination of the extracted center mixed component and a residual signal, wherein the residual signal comprises a non-extracted component of the stereo input signal.
2. The method of claim 1, wherein the center-mixed component comprises at least one of drums, bass, and leading vocals.
3. The method of claim 2, wherein the center-mixed component comprises each of drums, bass, and leading vocals.
4. The method of claim 1, further comprising separating a percussive component from the stereo input signal, such that the percussive component includes leading vocals, and wherein applying the mask to the stereo input includes applying the mask to the percussive component.
5. The method of claim 4, further comprising applying a high pass filter to the stereo input signal, and wherein separating the percussive component includes separating the percussive component from the high-pass filtered stereo input signal.
6. The method of claim 4, further comprising applying a low-pass filter to the stereo input signal, and wherein the output signal includes the low-pass filtered stereo input signal.
7. The method of claim 6, wherein applying the mask to the stereo input signal includes applying the mask to a combined signal comprised of the low-pass filtered stereo input signal and the percussive component.
8. The method of claim 1, wherein the output signal is a mono output signal, further comprising providing the mono output signal to a hearing prosthesis.
9. The method of claim 1, wherein the output signal is a stereo output signal, further comprising providing the stereo output signal to bilateral hearing prostheses.
10. The method of claim 1, wherein generating the output signal comprised of the weighted combination of the extracted center-mixed component and the residual signal comprises: weighting the extracted center-mixed component by a first weighting factor; and weighting the residual signal by a second weighting factor, wherein the first weighting factor is different from the second weighting factor.
11. The method of claim 10, wherein the first weighting factor has a value of approximately 1 in a range of 0 to 1, and wherein the second weighting factor has a value of approximately 0.25-0.5 in the range of 0 to 1.
12. A method for creating an audio output signal for a hearing prosthesis, the method comprising: separating a preferred musical instrument component from a channelized audio input signal; and enhancing separation of the preferred musical instrument component by applying a mask to the separated preferred musical instrument component.
13. The method of claim 12, wherein the audio output signal is a mono audio output signal, further comprising providing the audio output signal to the hearing prosthesis.
14. The method of claim 12, wherein the audio output signal is a stereo audio output signal further comprising providing the audio output signal to bilateral hearing prostheses comprising a first hearing prosthesis and a second hearing prosthesis.
15. The method of claim 12, wherein the channelized audio input signal is a stereo input signal, and wherein applying the mask further comprise applying a stereo mask to the separated preferred musical instrument component.
16. The method of claim 15, wherein the stereo mask masks components that are outside a middle portion of a stereo image associated with the stereo input signal.
17. The method of claim 15, further comprising separating the stereo input signal into percussive components and harmonic components, wherein the preferred musical instrument component includes the percussive components.
18. The method of claim 17 further comprises: high-pass filtering the stereo input signal prior to the separating, wherein separating the preferred musical instrument component includes separating the preferred musical instrument component from the high-pass filtered stereo input signal; low-pass filtering the stereo input signal prior to the applying the mask, wherein the mask is applied to a combination of the percussive components and the low-pass-filtered stereo input signal; and weighting the masked combination relative to a residual signal comprising at least the harmonic components to create the audio output signal.
19. The method of claim 12, wherein the at least one preferred musical instrument component includes leading vocals and drums.
20. The method of claim 12, wherein the preferred musical instrument component is a first preferred musical instrument component, further comprising separating a second preferred musical component from the channelized audio input signal, and wherein applying the mask includes applying the mask to a combination of the first and second preferred musical instrument components.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
DETAILED DESCRIPTION
(7) Referring to the drawings, as noted above,
(8) When music is recorded and mixed, such as in a studio or at a live event, the mixer frequently tries to duplicate the relative placement of instrumental components to approximate the experience that a listener (such as the listener 114) would experience at the live event. In one example for a stereo mix, each instrument (including leading vocals) is first recorded as a separate track, so that the mixer can independently adjust (pan) the volume and channel (e.g. left and/or right in a stereo signal) of each track to produce a recorded music track that provides a listener with a sensation of spatially arranged instrumental components. In a second example, a stereo recording is made at a live event using a separate microphone for each channel (e.g. left and right microphones for a stereo signal). By suitably placing the left and right microphones in front of the arrangement (e.g. arrangement 100) of instruments, the recording is, to some extent, approximating what the listener (e.g. listener 114) hears with his two ears (e.g. 116a-b). As a further extension to this second example, the live-music recording could also be performed using microphones present in the left and right sides of binaural or bilateral hearing devices. However, in this further extension, the stereo image would be less than ideal unless the listener were positioned in the center (in front of a live band).
(9) According to the first example described above, in which the mixer performs a panning function to create a stereo image having a left channel and a right channel, the mixer may follow a set of panning rules to give the listener the feeling that he or she is looking at (listening to) the band on stage. A typical set of panning rules for a stereo mix may specify, for example, that a kick (bass) drum and snare drum are panned in the center, together with a bass. Tom-tom drums and a high-hat cymbal are panned slightly off center, and the sound recorded by two overhead microphones panned completely to the left or right. Other instruments are panned as they are (or would typically be) located on stage, typically off-center. A piano (keyboard) is typically a stereo signal and is divided between the left and right channels. Finally, the leading vocals are in the center, with backing vocals located completely left or right. At least some of the embodiments described herein utilize aspects of this typical stereo mix to assist in pre-processing music to improve music perception and appreciation for hearing prosthesis recipients. In further embodiments, information pertaining to location of instruments in the stereo (or other channelized) mix is included as metadata embedded in the channelized recording. This metadata can be utilized to extract and enhance preferred components (e.g. leading vocals, bass, and drum) relative to non-preferred (less preferred) components.
(10) As described in detail below, with respect to the accompanying figures, various preferred embodiments set forth herein exploit the center-panning of leading vocal, bass, and drum relative to other instruments in a stereo signal in order to separate (extract) and enhance the leading vocal, bass, and drums relative to those other instruments. This separation and enhancement is applicable to modify commercially recorded stereo music intended for listeners having normal hearing. While instrument-location metadata could be included in the recording itself, as described above, musical recordings might not maintain information pertaining to separate tracks for each instrument, which is one reason why separating the leading vocal, bass, and drum from the stereo signal is advantageous. By relatively enhancing (i.e. pre-processing) the leading vocal, bass, and drums, a hearing prosthesis recipient may experience better perception and appreciation of the music.
(11)
(12) As illustrated in blocks 206-212 of
(13) As illustrated in blocks 214-220, each extracted component is preferably weighted by a respective weighting factor W1-W4. For example, if a first component is to be weighted more heavily than a second component, then the first weighting factor should be larger than the second weighting factor, according to one embodiment. According to one embodiment, weighting factors W1-W4 have values between 0 and 1, where a weighting factor of 0 means the extracted component is completely suppressed and a weighting factor of 1 means the extracted component is unaltered (i.e. no decrease in relative volume). In the example of
(14) The scheme 200 may be implemented using one or more algorithms, such as those illustrated in
(15) Alternatively, the scheme 200 (or a similar such scheme) may be run in advance on a library of mp3 files to create a corresponding library of pre-processed mp3 files intended for the hearing prosthesis recipient. In such a case, accuracy of extraction and enhancement will likely be more important than latency, and thus, algorithms that are more data-intensive might be preferable.
(16) As yet another alternative, the scheme 200 may be run in near-real-time (i.e. with low latency) on a streamed music source (such as a streamed on-line radio station or other source) to allow the hearing prosthesis recipient to listen to a delayed version of the music stream that is more conducive to the recipient being able to perceive and appreciate musical aspects (e.g. lyrics and/or melody) of the stream.
(17) As still yet another alternative, the scheme 200 may be applied to a live music performance, such as through two or more microphones (e.g. left and right microphones on binaural or bilateral hearing prostheses) to pre-process the live music to produce a corresponding version (with some latency, depending on processor speed and the choice of extraction algorithm used) that allows for better perception and appreciation of the live music performance by the recipient. Application of the scheme 200 to a live-music context preferably includes using an algorithm with very low latency, such as less than 20 msec., which will better allow the hearing prosthesis recipient to concurrently perform lip-reading of a vocalist, for example. In addition, the hearing prosthesis recipient should be physically located in a relatively central location in front of the live-music stage/source (the stereo-recording sweet spot), so that the signals from the left and right microphones on the hearing prosthesis provide input signals more amendable to the separation algorithms set forth herein. Other examples, including other file and signal types, are possible as well, and are intended to be within the scope of this disclosure, unless indicated otherwise.
(18) The scheme of
(19)
(20) The input power spectrum W from block 302 is filtered by a high-pass filter (block 304) and a low-pass filter (block 306). An unfiltered version of the input power spectrum W from block 302 is utilized elsewhere (to create a residual signal), as will be described in block 316. The output of the low-pass filter (e.g. up to 400 Hz) of block 306 includes bass (low frequency) components that provide more fullness and better continuity (less beating), which will generally result in an improved listening experience for hearing prosthesis recipients.
(21) The output of the high-pass filter (e.g. above 400 Hz) from block 304 is subjected to a separation algorithm (block 310), to separate out (extract) various musical components. In a preferred embodiment, and as illustrated, the separation algorithm is the Harmonic/Percussive Sound Separation (HPSS) algorithm described by Ono et al., Separation of a Monaural Audio Signal into Harmonic/Percussive Components by Complementary Diffusion on Spectrogram, Proc. EUSIPCO, 2008, which is incorporated by reference herein in its entirety. Tachibana et al., Comparative evaluations of various harmonic/percussive sound separation algorithms based on anisotropic continuity of spectrogram, Proc. ICASSP, pp. 465-468, 2012, is also incorporated by reference herein in its entirety. The HPSS algorithm separates the harmonic and percussive components of an audio signal based on the anisotropic smoothness of these components in the spectrogram, using an iteratively-solved optimization problem. The optimization problem is solved by minimizing the cost function J in equation (1) below:
(22)
under constraints (2) and (3) below:
H.sub.,.sup.2+P.sub.,.sup.2=W.sub.,.sup.2(2)
H.sub.,0, P.sub.,0(3)
where H and P are sets of H.sub., and P.sub.,, respectively, and weights .sub.H and .sub.P are parameters to control the horizontal and vertical numerical smoothness in the cost function. Minimization of the cost function J results from minimizing the sum of the time-shifted version of H (harmonic components, horizontal) and the frequency-shifted version of P (percussive components, vertical) through numeric iteration. Constraint (2), above, ensures that the sum of the harmonic and percussive components makes up the original input power spectrogram. Constraint (3), above, ensures that all harmonic and percussive components are non-negative. The result of applying the separation algorithm (310) is to separate the high-pass-filtered signal from block 304 into harmonic components H and percussive components P. As stated above, the HPSS algorithm is iterative (with the iterations being subject to the additional constraint (4) described below with respect to block 314); a few iterations will generally be necessary to reach convergence, in accordance with a preferred embodiment. In addition, temporal-variable tones, such as vocals, can be harmonic or percussive depending on the frame length of the STFT (Short Time Fourier Transform) used in the HPSS algorithm. This frame-length dependence is illustrated in
(23) Note that, in
(24) The percussive components P resulting from the separation algorithm of block 310 are combined (summed) with the bass (low-frequency) components resulting from the low-pass-filtered input power spectrum W output from block 306.
(25) A stereo binary mask is applied at block 314 to the percussive components P, and, preferably, the low-pass-filtered (block 306) version of the input power spectrum W (block 302). The stereo binary mask identifies the center of the stereo image (see formula (12), below), which is where leading vocals, bass, and drum are typically mixed (assuming that the stereo input signal does not contain metadata indicating instrument arrangement; see the discussion infra and supra regarding such metadata). In this respect, the stereo binary mask acts as an additional constraint (i.e. a center stereo constraint) on the separation algorithm (e.g. HPSS) of block 310. Using equation (1) and constraints (2) and (3) above for the HPSS algorithm, this additional constraint can be defined as:
P.sub., in the middle of stereo image(4)
As mentioned above, with respect to block 310, this additional constraint is preferably included in the iterative solution of the HPSS algorithm.
(26) The above equations can be solved numerically using the following iteration formulae:
(27)
in which is a parameter having a value of .sub.H.sup.2/.sub.P.sup.2, tuned to maximize separation between harmonic and percussive components. In a preferred embodiment, has a value of 0.95, which has been found to provide an acceptable tradeoff between separation and distortion.
(28) Including constraint (4), above, the iteration formulae become the following:
(29)
(30) where W.sub.diff is the spectrogram of the difference between left channel and right channel. The binary mask preferably consists of a matrix of 1's and 0's, with 1 corresponding to time-frequency bins for which condition (*W.sub.diff<W.sub.L) & (*W.sub.diff<W.sub.R) is true, indicating a center-mixed component (e.g. leading vocals, bass, and drums) and 0 for which the condition is false, indicating a non-center-mixed component (e.g. backing vocals and other instruments). The parameter is an adjustable parameter to control the angle relative to the center of the stereo image to broaden the considered center-panned area. For example, every instrument can be panned across a range from 100 (left) over 0 (center) to +100 (right). Lower values of generally correspond to less attenuation of instruments at wide angles (e.g. panned near 100 or +100) and practically no attenuation of instruments panned at narrower angles. Higher values of generally correspond to more attenuation of instruments panned at all angles, except near the center, with the amount of attenuation (suppression) increasing as the panning angle increases. According to a preferred embodiment, is chosen to be 0.4, corresponding to an angle of about +/50 degrees. This angle results in a relatively good separation between different components (e.g. vocals versus guitar).
(31) At block 316, the output of block 314 is subtracted from the input power spectrum W of block 302, leaving a residual signal (preferably after several iterations), shown as H_stereo, corresponding to what was removed from the input power spectrum W. An attenuation parameter (block 318) is then applied to the residual signal at block 320. For example, the attenuation parameter could be one or more adjustable weighting factors that the recipient adjusts to produce a preferred music-listening experience. Sample attenuation parameter settings are 1, 0 db (no attenuation), 0.5 (6 dB), 0.25 (12 dB), and 0.125 (18 dB). Setting and applying the attenuation parameter effectively emphasizes (e.g. increases the volume of) the center of the stereo image of the percussive components P relative to the non-center/non-percussive components. For a typical music recording, this will result in enhanced leading vocals, rhythm (drum), and bass relative to other components, thereby potentially improving a hearing prosthesis recipient's perception and appreciation of music.
(32) Per the above discussion of the iterative process, the P_stereo and H_stereo outputs from blocks 314 and 316, respectively, are updated iteratively. In the current preferred implementation, for example, there are ten iterations before the final P_stereo and H_stereo outputs are passed on to subsequent blocks (i.e. for relative enhancement and/or attenuation). Fewer iterations, while improving latency, typically results in poorer separation between components, making the resulting output signal difficult for a hearing-impaired person to comprehend.
(33) After the attenuation of block 320, the attenuated signal is summed at block 322 with the output of block 314 to produce an output signal 324, preferably in the same format as the original stereo input signal. The output signal 324 could, for example, be a mono signal, which would be suitable for a hearing prosthesis (e.g. a current typical cochlear implant) having a mono input. Alternatively, the output signal 324 could be a stereo signal, which may have application for bilateral hearing prostheses, for example.
(34)
(35) As shown in
(36) At block 504, an output signal is output. The output signal is comprised of a weighted combination of the extracted center-mixed component and a residual signal comprising a non-extracted part of the stereo input signal. In one example, an extracted center-mixed component is combined with a residual signal in which one or more non-center-mixed components are attenuated (weighted less) relative to the extracted center-mixed component. The attenuation may be through one or more weighting factors, as was described above with respect to
(37) While the method 500 has been described with respect to the input signal being a stereo input signal having a broad stereo image, other channelized signals having extensive panning (e.g. a surround sound signal in which leading vocals, bass, and drum are in a center channel and backing vocals and less important or preferred instruments are panned towards one of the surround channels) would also be suitable candidates for applying a method in accordance with the concepts of the method 500 in
(38) Moreover, while the example of
(39) The methods described herein, including the methods shown in
(40)
(41) The audio cable also includes an electronics module 608 containing electronics such as volume-control electronics and isolation circuitry, for example. In accordance with a preferred embodiment, the electronics module 608 additionally includes a filter or other electronics to extract a portion of the channelized input audio signal such that the output signal includes a weighted version of the extracted portion of the channelized input audio signal. Such a filter may, for example, implement the masking function described with reference to
(42) The above discussion references several types of input files, signals, and streams that may be pre-processed in accordance with the concepts described herein. Reference was also made to the possibility of including metadata in a song recording, in order to specify a number of possible parameters, such as which instruments are played, how panning (e.g. stereo panning) is performed, etc. For example, a digital data file corresponding to a recorded (and mixed) song might consist of one or more packet headers or other data constructs that specify these parameters at the beginning of, or throughout, the song. With knowledge of how this metadata is contained in such a recording, a device receiving or playing the file (e.g. as an input signal) can potentially identify the relative placement of instruments used for panning. This identified placement can be used to improve (e.g. decrease latency and/or improve accuracy) the separation/enhancement process of one or more of the method set forth herein. In particular, for example, the method 300 illustrated in
(43) While many of the above examples are described in the context of a stereo signal, the concepts set forth herein are applicable to other channelized signals and, unless otherwise specified, the claims are intended to encompass a full range of channelized signals beyond just stereo signals. For example, surround sound, CD (compact disc), DVD (digital video disc), Super Audio CD, and others are intended to be included within the realm of signals to which various described embodiments apply.
(44) Exemplary embodiments have been described above. It should be understood, however, that numerous variations from the embodiments discussed are possible, while remaining within the scope of the invention.