Music classifier and related methods
11240609 · 2022-02-01
Assignee
Inventors
Cpc classification
H04R2460/03
ELECTRICITY
G10L25/18
PHYSICS
G10H2210/046
PHYSICS
H04R2225/41
ELECTRICITY
International classification
G06F17/00
PHYSICS
Abstract
An audio device that includes a music classifier that determines when music is present in an audio signal is disclosed. The audio device is configured to receive audio, process the received audio, and to output the processed audio to a user. The processing may be adjusted based on the output of the music classifier. The music classifier utilizes a plurality of decision making units, each operating on the received audio independently. The decision making units are simplified to reduce the processing, and therefore the power, necessary for operation. Accordingly each decision making unit may be insufficient to determine music alone but in combination may accurately detect music while consuming power at a rate that is suitable for a mobile device, such as a hearing aid.
Claims
1. A music classifier for an audio device, the music classifier comprising: a signal conditioning unit configured to transform a digitized, time-domain audio signal into a corresponding frequency domain signal including a plurality of frequency bands; a plurality of decision making units operating in parallel that are each configured to evaluate one or more of the plurality of frequency bands to determine a plurality of feature scores, each feature score corresponding to a characteristic associated with music, the plurality of decision making units including: a modulation activity tracking unit configured to output a feature score for modulation activity based on a ratio of a first value of an averaged wideband energy of the plurality of frequency bands to a second value of the averaged wideband energy of the plurality of frequency bands; and a tone detection unit configured to output feature scores for tone in each frequency band based on (i) an amount of energy in the frequency band and (ii) a variance of the energy in the frequency band based on a first order differentiation; and a combination and music detection unit configured to: asynchronously receive feature scores from the plurality of decision making units, the decision making units configured to output feature scores at different intervals; and combine the plurality of feature scores over a period of time to determine if the audio signal includes music.
2. The music classifier for the audio device according to claim 1, wherein the plurality of decision making units include a beat detection unit.
3. The music classifier for the audio device according to claim 2, wherein the beat detection unit is configured to detect, based on a correlation, a repeating beat pattern in a first frequency band that is the lowest of the plurality of frequency bands.
4. The music classifier for the audio device according to claim 2, wherein the beat detection unit is configured to detect a repeating beat pattern, based on an output of a beat detection (BD) neural network.
5. The music classifier for the audio device according to claim 4, wherein the beat detection unit is configured to select one or more frequency bands from the plurality of frequency bands and is configured to extract a plurality of features from each selected frequency band.
6. The music classifier for the audio device according to claim 5, wherein the plurality of features extracted from each selected frequency band form a feature set including an energy mean, an energy standard deviation, an energy maximum, an energy kurtosis, an energy skewness, and an energy cross-correlation vector.
7. The music classifier for the audio device according to claim 6, wherein the BD neural network receives the feature set for each selected band as a plurality of inputs.
8. The music classifier for the audio device according to claim 1, wherein the second value corresponds a minimum of the averaged wideband energy and the first value corresponds to a maximum of the averaged wideband energy, the averaged wideband energy corresponding to an average of a sum of the energy in each of the plurality of frequency bands.
9. The music classifier for the audio device according to claim 1, wherein the combination and music detection unit is configured to apply a weight to each feature score to obtain weighted feature scores and to sum the weighted feature scores to obtain a music score, each weight having a value that depends, in part, on the interval that the corresponding feature score is output from the decision making unit.
10. The music classifier for the audio device according to claim 9, wherein the combination and music detection unit is further configured to accumulate music scores for a plurality of frames, to compute an average of the music scores for the plurality of frames, and to compare the average to a threshold.
11. The music classifier for the audio device according to claim 10, wherein the combination and music detection unit is further configured to apply a hysteresis control to a music or no music output of the threshold.
12. A method for music detection in an audio signal, the method comprising: receiving an audio signal; digitizing the audio signal to obtain a digitized audio signal; transforming the digitized audio signal into a plurality of frequency bands; applying the plurality of frequency bands to a plurality of decision making units operating in parallel, the plurality of decision making units including: a modulation activity tracking unit configured to output a feature score for modulation activity based on a ratio of a first value of an averaged wideband energy of the plurality of frequency bands to a second value of the averaged wideband energy of the plurality of frequency bands; and a tone detection unit configured to output feature scores for tone in each frequency band based on (i) an amount of energy in the frequency band and (ii) a variance of the energy in the frequency band based on a first order differentiation; and obtaining, asynchronously, a feature score from each of the plurality of decision making units, the decision making units configured to output feature scores at different intervals, and the feature score from each decision making unit corresponding to a probability that a particular music characteristic is included in the audio signal; and combining the feature scores to detect music in the audio signal.
13. The method for music detection according to claim 12, wherein the decision making units include a beat detection unit, and wherein: obtaining a feature score from the beat detection unit includes: detecting, based on a correlation, a repeating beat pattern in a first frequency band that is the lowest of the plurality of frequency bands.
14. The method for music detection according to claim 12, wherein the decision making units include a beat detection unit, and wherein: obtaining a feature score from the beat detection unit includes: detecting, based on a neural network, a repeating beat pattern in the plurality of frequency bands.
15. The method for music detection according to claim 12, wherein: obtaining a feature score from the modulation activity tracking unit includes: tracking a minimum averaged energy of a sum of the plurality of frequency bands as the second value and a maximum averaged energy of the sum of the plurality of frequency bands as the first value.
16. The method for music detection according to claim 12, wherein the combining comprises; multiplying the feature score from each of the plurality of decision making units with a respective weight to obtain a weighted score from each of the plurality of decision making units, each weight having a value that depends, in part, on the interval that the corresponding feature score is output from the decision making unit; summing the weighted scores from the plurality of decision making units to obtain a music score; accumulating music scores over a plurality of frames of the audio signal; averaging the music scores from the plurality of frames of the audio signal to obtain an average music score; and comparing the average music score to a threshold to detecting music in the audio signal.
17. The method for music detection in an audio signal according to claim 12, further comprising: modifying the audio signal based on the music detection; and transmitting the audio signal.
18. A hearing aid, comprising: a signal conditioning stage configured to convert a digitized audio signal to a plurality of frequency bands; and a music classifier coupled to the signal conditioning stage, the music classifier including: a feature detection and tracking unit that includes a plurality of decision making units operating in parallel, each decision making unit configured to generate a feature score corresponding to a probability that a particular music characteristic is included in the audio signal, the plurality of decision making units including: a modulation activity tracking unit, the modulation activity tracking unit configured to output a feature score for modulation activity based on a ratio of a first value of an averaged wideband energy of the plurality of frequency bands to a second value of the averaged wideband energy of the plurality of frequency bands; and a tone detection unit configured to output feature scores for tone in each frequency band based on (i) an amount of energy in the frequency band and (ii) a variance of the energy in the frequency band based on a first order differentiation; and a combination and music detection unit configured to: asynchronously receive feature scores from the plurality of decision making units, the decision making units configured to output feature scores at different intervals; and combine the plurality of feature scores over time to detect music in the audio signal, the combination and music detection unit configured to produce a first signal indicating music while music is detected in the audio signal and configured to produce a second signal indicating no-music signal otherwise.
19. The hearing aid according to claim 18, wherein the hearing aid includes an audio signal modifying stage coupled to the signal conditioning stage and to the music classifier, the audio signal modifying stage configured to process the plurality of frequency bands differently when a music signal is received than when a no-music signal is received.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12) The components in the drawings are not necessarily to scale relative to each other. Like reference numerals designate corresponding parts throughout the several views.
DETAILED DESCRIPTION
(13) The present disclosure is directed to an audio device (i.e., apparatus) and related method for music classification (e.g., music detection). As discussed herein, music classification (music detection) refers to identifying music content in an audio signal that may include other audio content, such as speech and noise (e.g., background noise). Music classification can include identifying music in an audio signal so that the audio can be modified appropriately. For example, the audio device may be a hearing aid that can include algorithms for reducing noise, cancelling feedback, and/or controlling audio bandwidth. These algorithms may be enabled, disabled, and/or modified based on the detection of music. For example, a noise reduction algorithm may reduce signal attenuation levels while music is detected to preserve a quality of the music. In another example, a feedback cancellation algorithm may be prevented (e.g., substantially prevented) from cancelling tones from music as it would otherwise cancel a tone from feedback. In another example, the bandwidth of audio presented by the audio device to a user, which is normally low to preserve power, may be increased when music is present to improve a music listening experience.
(14) The implementations described herein can be used to implement a computationally efficient and/or power efficient music classifier (and associated methods). This can be accomplished through the use of decision making units that can each detect a characteristic (i.e., features) corresponding to music. Alone, each decision making unit may not classify music with a high accuracy. The outputs of all the decision making units, however, may be combined to form an accurate and robust music classifier. An advantage of this approach is that the complexity of each decision making unit can be limited to conserve power without negatively affecting the overall performance of the music classifier.
(15) In the example implementations described herein, various operating parameters and techniques, such as thresholds, weights (coefficients), calculations, rates, frequency ranges, frequency bandwidths, etc. are described. These example operating parameters and techniques are given by way of example, and the specific operating parameters, values, and techniques (e.g., computation approaches) used will depending on the particular implementation. Further, various approaches for determining the specific operating parameters and techniques for a given implementation can be determined in a number of ways, such as using empirical measurements and data, using training data, and so forth.
(16)
(17) The audio signal modifying stage 150 may be configured to improve a quality of the digital audio signal by cancelling noise, filtering, amplifying, and so forth. The processed (e.g., improved quality) audio signal can then be transformed 151 to a time-domain digital signal and converted into an analog signal by a digital-to-analog (D/A) converter 160 for playback on an audio output device (e.g., speaker 170) to produce output audio 171 for a user.
(18) In some possible implementations, the audio device 100 is a hearing aid. The hearing aid receives audio (i.e., sound pressure waves) from an environment 111, process the audio as described above, and presents (e.g., using a receiver of a hearing aid170) the processed version of the audio as output audio 171 (i.e., sound pressure waves) to a user wearing the hearing aid. Algorithms implemented audio signal modifying stage can help a user understand speech and/or other sounds in the user's environment. Further, it may be convenient if the choice and/or adjustment of these algorithms proceed automatically based on various environments and/or sounds. Accordingly, the hearing aid may implement one or more classifiers to detect various environments and/or sounds. The output of the one or more classifiers can be used adjust one or more functions of the audio signal modifying stage 150 automatically.
(19) One aspect of desirable operation may be characterized by the one or more classifiers providing highly accurate results in real-time (as perceived by a user). Another aspect of desirable operation may be characterized by low power consumption. For example, a hearing aid and its normal operation may define a size and/or a time between charging of a power storage unit (e.g., battery). Accordingly, it is desirable that an automatic modification of the audio signal based on real-time operation of one or more classifiers does not significantly affect the size and/or the time between changing of the battery for the hearing aid.
(20) The audio device 100 shown in
(21) The music classifier 140 disclosed herein receives as its input, the output of a signal conditioning stage 130. The signal conditioning stage can also be used as part of the routine audio processing for the hearing aid. Accordingly, an advantage of the disclosed music classifier 140 is that it can use the same processing as other stages, thereby saving complexity and power requirements. Another advance of eh disclosed music classifier is its modularity. The audio device may deactivate the music classifier without affecting its normal operation. In a possible implementation, for example, the audio device could deactivate the music classifier 140 upon detecting a low power condition (i.e., low battery).
(22) The audio device 100 includes stages (e.g., signal conditioning 130, music classifier 140, audio signal modifying 150, signal transformation 151, other classifiers 180) that can be embodied as hardware or as software. For example, the stages may be implemented as software running on a general purpose processor (e.g., CPU, microprocessor, multi-core processor, etc.) or special purpose processor (e.g., ASIC, DSP, FPGA, etc.).
(23)
(24) As shown in
(25) The frequency bands 220 (i.e., BAND_0, BAND_1, etc.) may be processed to modify the audio signal 111 received at the audio device 100. For example, the audio signal modifying stage 150 (see
(26) As shown in
(27) The music classifier is configured to receive the frequency bands 220 from the signal conditioning stage 130 and to output a signal that indicates the presence or absences of music. For example, the signal may include a first level (e.g., a logical high voltage) indicating the presence of music and a second level (e.g., a logical low voltage) indicating the absence of music. The music classifier 140 can be configured to receive the bands continuously and to output the signal continuously so that a change in the level of the signal correlates in time to the moment that music begins or ends. As shown in
(28)
(29) Each decision making unit of the feature detection and tracking unit of the music classifier may receive one or more (e.g., all) of the bands from the signal conditioning. Each decision making unit is configured to generate at least one output that corresponds to a determination about a particular music characteristic. The output of a particular unit may correspond to a two-level (e.g., binary) value (i.e., feature score) that indicates a yes or a no (i.e., a true or a false) answer to the question, “is the feature detected at this time.” When a music characteristic has a plurality of components (e.g., tones), a particular unit may produce a plurality of outputs. In this case, each of the plurality of outputs may each correspond to a to a detection decision (e.g., a feature score that equals a logical 1 or a logical 0) regarding one of the plurality of components. When a particular music characteristic has a temporal (i.e., time-varying) aspect, the output of a particular unit may correspond to the presences or absence of the music characteristic in a particular time window. In other words, the output of the particular unit tracks the music characteristics having the temporal aspect.
(30) Some possible music characteristics that may be detected and/or tracked are a beat, a tone (or tones), and a modulation activity. While alone each of these characteristics may be insufficient to accurately determine whether an audio signal contains music, when combined they the accuracy of the determination can be increased. For example, determining that an audio signal has one or more tones (i.e., tonality) may be insufficient to determine music because a pure (i.e. temporally constant) tone can be included in (e.g., exist in) an audio signal without being music. Determining that the audio signal also has a high modulation activity can help determine that the determined tones are likely music (and not a pure tone from another source). A further determination that the audio signal has a beat would strongly indicate the audio contains music. Accordingly, the feature detection and tracking unit 200 of the music classifier 140 can include a beat detection unit 210, a tone detection unit 240, and a modulation activity tracking unit 270.
(31)
E.sub.0[n]=X.sup.2[n, 0]
(32) where n is the current frame number, X [n, 0] is the real BAND_0 data and E.sub.0[n] is the instantaneous BAND_0 energy for the current frame. If a WOLA filter-bank of the signal conditioning stage 130 is configured to be in an even stacking mode, the imaginary part of the BAND_0 (which would otherwise be 0 with any real input) is filled with a (real) Nyquist band value. Thus, in the Even Stacking mode E.sub.0[n] is rather calculated as:
E.sub.0[n]=real{X[n, 0]}.sup.2
(33) E.sub.0[n] is then low-passed filtered 214 prior to a decimation 216 to reduce aliasing. One of the simplest and most power efficient low-pass filters 214 that can be used is the first order exponential smoothing filter:
E.sub.OLFP[n]=α.sub.bd×E.sub.OLPF[n−1]+(1−α.sub.bd)×E.sub.0[n]
(34) where α.sub.bd is the smoothing coefficient and E.sub.OLFP[n] is the low-passed BAND_0 energy. Next, E.sub.OLFP[n] is decimated 216 by a factor of M producing E.sub.b[m] where m is the frame number at the decimated rate:
(35)
where R is the number of samples in each frame, n.
At this decimated rate, screening for a potential beat is carried out at every m=N.sub.b where N.sub.b is the beat detection observation period length. The screening at the reduced (i.e., decimated) rate can save power consumption by reducing the number of samples to be processed within a given period. The screening can be done in several ways. One effective and computationally efficient method is using normalized autocorrelation 218. The autocorrelation coefficients can be determined as:
(36)
(37) where τ is the delay amount at the decimated frame rate and α.sub.b[m, τ] is the normalized autocorrelation coefficients at decimated frame number m and delay value τ.
(38) A beat detection (BD) decision 220 is then made. To decide that a beat is present, α.sub.b[m, τ] is evaluated over a range of τ delays and a search is then done for the first sufficiently high local α.sub.b[m, τ] maximum according to an assigned threshold. The sufficiently high criterion can provide a strong enough correlation for the finding to be considered as a beat in which case, the associated delay value, τ, determines the beat period. If a local maximum is not found or if no local maximum is found to be sufficiently strong, the likelihood of a beat being present is considered low. While finding one instance that meets the criteria might be sufficient for beat detection, multiple findings with same delay value over several N.sub.b intervals greatly enhance the likelihood. Once a beat is detected, the detection status flag BD [m.sub.bd] is set to 1 where m.sub.bd is the beat detection frame number at the
(39)
rate. If a beat is not detected, the detection status flag BD[m.sub.bd] is set to 0. Determining the actual tempo value is not explicitly required for beat detection. However, if the tempo is required, the beat detection unit may include a tempo determination that uses a relationship between r and the tempo in beats per minute as:
(40)
(41) Since typical musical beats are between 40 and 200 bpm, a.sub.b[m, τ] needs to be evaluated over only the r values that correspond to this range and thus, unnecessary calculations can be avoided to minimize the computations. Consequently, a.sub.b[τ] is evaluated only at integer intervals between:
(42)
(43) The parameters R, α.sub.bd, N.sub.b, M, the filter-bank's bandwidth, and the filter-bank's sub-band filters' sharpness are all interrelated and independent values cannot be suggested. Nevertheless, the parameter value selection has a direct impact on the number of computations and the effectiveness of the algorithm. For example, higher N.sub.b values produce more accurate results. Low M values may not be sufficient to extract the beat signature and high M values may lead to measurement aliasing jeopardizing the beat detection. The choice of α.sub.bd is also linked to R, F.sub.S and the filter-bank characteristics and a misadjusted value may produce the same outcome as a misadjusted M.
(44)
(45) In a possible implementation, the plurality of features extracted 222 (e.g., for the selected bands) may include an energy mean for the band. For example, a BAND_0 energy mean (E.sub.b,μ) may be computed as:
(46)
(47) where N.sub.b is the observation period (e.g. number of previous frames) and m is the current frame number.
(48) In a possible implementation, the plurality of features extracted 222 (e.g., for the selected bands) may include an energy standard deviation for the band. For example, a BAND_0 energy standard deviation (E.sub.b,σ) may be computed as:
(49)
(50) In a possible implementation, the plurality of features extracted 222 (e.g., for the selected bands) may include an energy maximum for the band. For example, a BAND_0 energy maximum (E.sub.b_max) may be computed as:
E.sub.b_max[m]=max(E.sub.b[m−i]|.sub.i=0.sup.i=N.sup.
(51) In a possible implementation, the plurality of features extracted 222 (e.g., for the selected bands) may include an energy kurtosis for the band. For example, a BAND_0 energy kurtosis (E.sub.b_k) may be computed as:
(52)
(53) In a possible implementation, the plurality of features extracted 222 (e.g., for the selected bands) may include an energy skewness for the band. For example, a BAND_0 energy skewness (E.sub.b_s) may be computed as:
(54)
(55) In a possible implementation, the plurality of features extracted 222 (e.g., for the selected bands) may include an energy cross-correlation vector for the band. For example, a BAND_0 energy cross-correlation vector (E.sub.b_xcor) may be computed as:
Ē.sub.b_xcor[m]=[α.sub.b[m, τ.sub.40], α.sub.b[m, τ.sub.40−1], . . . , α.sub.b[m, τ.sub.200+1], α.sub.b[m, τ.sub.200]]
(56) where τ is the correlation lag (i.e., delay). The delays in the cross-correlation vector may be computed as:
(57)
(58) While the present disclosure is not limited to the set of extracted features described above, in a possible implementation, these features may form a feature set that a BD neural network 225 can use to determine a beat. One advantage of the features in this feature set is that they do not require computationally intensive mathematical calculation, which conserves processing power. Additionally the calculations share common elements (e.g., mean, standard deviation, etc.) so that the calculations of the shared common elements only need to be performed once of the feature set, thereby further conserving processing power.
(59) The BD neural network 225 can be implemented as a long short term memory (LSTM) neural network. In this implementation, the entire cross-correlation vector (i.e., Ē.sub.b_xcor[m]) may be used by the neural network to make reach a BD decision. In another possible implementation, the BD neural network 225 can be implemented as a feed-forward neural network that uses a single max value of the cross correlation vector, namely, E.sub.max_xcor[m] to reach a BD decision. The particular type BD neural network implemented can be based on a balance between performance and power efficiency. For beat detection, the feed forward neural network may show better performance and improved power efficiency.
(60)
E.sub.inst[n, k]=|X[n, k]|.sup.2
(61) Next, the band energy data is converted 512 to log2. While a high precision log2 operation can be used, if the operation is considered too expensive, one that would approximate the results within fractions of dB may be sufficient as long as the approximation is relatively linear in its error and monotonically increasing. One possible simplification is the straight-line approximation given as:
L=E+2 m.sub.r
(62) where E is the exponent of the input value and m.sub.r is the remainder. The approximation L can then be determined using a leading bit detector, 2 shift operations, and an add operation, instructions that are commonly found on most microprocessors. The log2 estimate of the instantaneous energy, called E.sub.inst_log[n, k], is then processed through a low-pass filter 514 to remove any adjacent bands' interferences and focus on the center band frequency in band k:
E.sub.pre_diff[n, k]=α.sub.pre×E.sub.pre_diff[n−1, k]+(1−α.sub.pre)×E.sub.inst_log[n,k]
(63) where α.sub.pre is the effective cut-off frequency coefficient and the resulting output is denoted by E.sub.pre_diff[n, k] or the pre-differentiation filter energy. Next a first order differentiation 516 takes place in the form of a single difference over the current and previous frames of R sample:
Δ.sub.mag[n, k]=E.sub.pre_diff[n, k]−E.sub.pre_diff[n−1, k]
(64) and the absolute value of Δ.sub.mag is taken. The resulting output |Δ.sub.mag[n, k]| is then passed through a smoothing filter 518 to obtain an averaged |Δ.sub.mag[n, k]| over multiple time frames:
Δ.sub.mag_avg[n,k]=α.sub.post×Δ.sub.mag_avg[n−1, k]+(1−α.sub.post)×|Δ.sub.mag[n, k]|
(65) where α.sub.post is the exponential smoothing coefficient and the resulting output Δ.sub.mag_avg[n, k] is a pseudo-variance measurement of the energy in band k and frame n in the log domain. Lastly, two conditions are checked to decide 520 (i.e., determine) whether tonality is present or not: Δ.sub.mag_avg[n, k] is checked against a threshold below which the signal is considered to have a low enough variance to be tonal and, E.sub.pre_diff[n, k] is checked against a threshold to verify the observed tonal component contains enough energy in the sub-band:
TN [n, k]=(Δ.sub.mag_avg[n, k]<Tonality.sub.Th[k]) && (E.sub.pre_diff[n, k]>SBMag.sub.Th[k])
(66) where TN[n, k] holds the tonality presence status in band k and frame n at any given time. In other words the outputs TD_0, TD_1, . . . TD_N can correspond to the likely hood that a tone within the band is present.
(67) One common signal that is not music but contains some tonality, exhibits similar (to some types of music) temporal modulation characteristics, and possesses similar (to some types of music) spectrum shape to music is speech. Since it is difficult to robustly distinguish speech from music based on the modulation patterns and spectrum differences, the tonality level becomes the critical point of distinction. The threshold, Tonality.sub.Th[k], must therefore be carefully selected not to trigger on speech but rather only in music. Since the value of Tonality.sub.Th[k] depends on the pre and post differentiation filtering amount, namely the values selected for α.sub.pre and α.sub.post, which themselves depend on F.sub.S and the chosen filter-bank characteristics, independent values cannot be suggested. However, the optimal threshold value can be obtained through optimizations on a large database for a selected set of parameter values. While SBMag.sub.Th[k] also depends on the selected α.sub.pre value, it is far less sensitive as its purpose is to merely make sure the discovered tonality is not too low in energy to be insignificant.
(68)
E.sub.wb_inst[n]=Σ.sub.k=0.sup.N.sup.
(69) where X[n, k] is the complex WOLA (i.e., sub-band) analysis data at frame n and band k. The wideband energy is then averaged over several frames by a smoothing filter 612:
E.sub.wb[n]=α.sub.w×E.sub.wb[n−1]+(1−α.sub.w)×E.sub.wb_inst[n]
(70) where α.sub.w is the smoothing exponential coefficient and E.sub.wb[n] is the averaged wideband energy. Beyond this step the modulation activity can be tracked to measure 614 a temporal modulation activity through different ways, some being more sophisticated while others being computationally more efficient. The simplest and perhaps the most computationally efficient method includes performing minimum and maximum tracking on the averaged wideband energy. For example the global minimum value of the averaged energy could be captured every 5 seconds as the min estimate of the energy, and the global maximum value of the averaged energy could be captured every 20 ms as the max estimate of the energy. Then, at the end of every 20 ms, the relative divergence between the min and max trackers is calculated and stored:
(71)
(72) where m.sub.mod is the frame number at the 20 ms interval rate, Max[m.sub.mod] is the current estimate of the wideband energy's maximum value, Min[m.sub.mod] is the current (last updated) estimate of the wideband energy's minimum value, and, r[m.sub.mod] is the divergence ratio. Next the divergence ratio is compared against a threshold to determine a modulation pattern 616:
LM[m.sub.mod]=(r[m.sub.mod]<Divergence.sub.th)
(73) The divergence value can take a wide range. A low-medium to high range would indicate an event that could be music, speech, or noise. Since the variance of a pure tone's wideband energy is distinctly low, an extremely low divergence value would indicate either a pure tone (of any loudness level) or an extremely low level non-pure-tone signal that would be in all likelihood too low to be considered anything desirable. The distinctions between speech vs. music and noise vs. music are made through tonality measurements (by the Tonality Detection Unit) and the beat presence status (by the Beat Detector Unit) and the modulation pattern or the divergence value does not add much value in that regard. However, since pure tones cannot be distinguished from music through tonality measurements and when present, they can satisfy the tonality condition for music, and since an absence of a beat detection does not necessarily mean a no-music condition, there is an explicit need for an independent pure-tone detector. As discussed, since the divergence value can be a good indicator for whether a pure tone is present or not, we use the modulation pattern tracking unit exclusively as a pure-tone detector to distinguish pure tones from music when tonality is determined to be present by the tone detection unit 240. Consequently, we set the Divergence.sub.th to a small enough value below which only either a pure tone or an extremely low level signal (that is of no interest) can exist. Consequently, LM [m.sub.mod] or the low modulation status flag effectively becomes a “pure-tone” or a “not-music” status flag to the rest of the system. The output (MA) of the modulation activity tracking unit 270 corresponds to a modulation activity level and can be used to inhibit a classification of a tone as music.
(74)
(75) The combination and music detection unit 300 may operate on asynchronously arriving inputs from the detection units (e.g., beat detection 210, tone detection 240, and modulation activity tracking 270) as they operate on different internal decision making (i.e., determination) intervals. The combination and music detection unit 300 also operates in an extremely computationally efficient form while maintaining accuracy. At the high level, several criteria must be satisfied for music to be detected. For example, a strong beat or a strong tone is present in the signal and the tone is not a pure-tone or an extremely low level signal.
(76) Since the decisions come in at different rates, the base update rate is set to the shortest interval in the system which is the rate the tonality detection unit 240 operates on or on every R samples (the n frames). The feature scores (i.e., decisions) are weighted and combined into a music score (i.e., score) as such:
(77) At every frame n:
B[n]=BD [m.sub.bd]
M[n]=LM[m.sub.mod]
(78) where B[n] is updated with the latest beat detection status and, M[n] is updated with the latest modulation pattern status. Then at every N.sub.MD interval:
(79)
where N.sub.MD is the music detection interval length in frames, β.sub.B is the weight factor associated with beat detection, β.sub.Tk is the weight factor associated with tonality detection, and, β.sub.M is the weight factor associated with pure-tone detection. The β weight factors can be determined based using training and or use and are typically factory set. The values of the β weight factors may depend on several factors that are described below.
(80) First, the values of the β weight factors may depend on an event's significance. For example, a single tonality hit may not be as significant of an event compared to a single beat detection event.
(81) Second, the values of the β weight factors may depend on the detection unit's internal tuning and overall confidence level. It is generally advantageous to allow some small percentage of failure at the lower level decision making stages and let long-term averaging to correct for some of that. This allows avoiding setting very restrictive thresholds at the low levels, which in turn, increases the overall sensitivity of the algorithm. The higher the specificity of the detection unit (i.e. a lower misclassification rate), the more significant the decision should be considered and therefore a higher weight value must be chosen. Conversely, the lower the specificity of the detection unit (i.e. a higher misclassification rate), the less conclusive the decision should be considered and therefore a lower weight value must be chosen.
(82) Third, the values of the β weight factors may depend on the internal update rate of the detection unit compared to the base update rate. Even though B [n], TN[n, k] and M[n] are all combined at every frame n, B[n], M[n] hold the same status pattern for many consecutive frames due to the fact that the beat detector and the modulation activity tracking units update their flags at a decimated rate. For example, if BD[m.sub.bd] runs on an update interval period of 20 ms and the base frame period is 0.5 ms, for every one actual BD[m.sub.bd] beat detection event, B[n] will produce 40 consecutive frames of beat detection events. Thus, the weight factors must consider the multi-rate nature of the updates. In the example above, if the intended weight factor for a beat detection event has been decided to be 2, then β.sub.B should be assigned to
(83)
to take into account the repeating pattern.
(84) Fourth, the values of the β weight factors may depend on the correlation relationship of the detection unit's decision to music. A positive β weight factor is used for detection units that support presence of music and a negative β weight factor is used for the ones that reject presence of music. Therefore the weight factors β.sub.B and β.sub.Tk hold positive weights whereas β.sub.m holds a negated weight value.
(85) Fifth, the values of the β weight factors may depend on the architecture of the algorithm. Since M[n] must be incorporated into the summation node as an AND operation rather than an OR operation, a significantly higher weight may be chosen for β.sub.m to nullify the outputs of B[n] and TN[n, k] and act as an AND operation.
(86) Even in the presence of music, not every music detection period may necessarily detect music. Thus is may be desired to accumulate several periods of music detection decisions prior to declaring music classification to avoid potential music detection state fluttering. It may also be desired to remain in the music state longer if we have been in the music state for a long time. Both objectives can be achieved very efficiently with the help of a music status tracking counter:
(87) if MusicDetected
MusicDetectedCounter=MusicDetectedCounter+1;
(88) else
MusicDetectedCounter=MusicDetectedCounter−1;
(89) end
MusicDetectedCounter=max(0, MusicDetectedCounter)
MusicDetectedCounter=min(MAX_MUSIC_DETECTED_COUNT, MusicDetectedCounter)
(90) where MAX_MUSIC_DETECTED_COUNT is the value at which the MusicDetectedCounter is capped at. A threshold is then assigned to the MusicDetectedCounter beyond which music classification is declared:
MusicClassification=(MusicDetectedCounter≥MusicDetectedCoutner.sub.th)
(91) In a second possible implementation of the combination and detection unit 300 of the music classifier 140, the weight application and combination process can be replaced by a neural network.
(92) The output of the music classifier 140 may be used in different ways and the usage depends entirely on the application. A fairly common outcome of a music classification state is retuning of parameters in the system to better suit a music environment. For example, in a hearing aid, when music is detected, an existing noise reduction may be disabled or tuned down to avoid any potential unwanted artifacts to music. In another example, a feedback canceller, while music is detected, does not react to the observed tonality in the input in the same way that it would when music is not detected (i.e., the observed tonality is due to feedback). In some implementations, the output of the music classifier 140 (i.e., MUSIC/NO-MUSIC) can be shared with other classifiers and/or stages in the audio device to help the other classifiers and/or stages perform one or more functions.
(93)
(94)
(95) The method begins by receiving 910 an audio signal (e.g., by a microphone). The receiving may include digitizing the audio signal to create a digital audio stream. The receiving may also include dividing the digital audio stream may be divided into frames and buffering the frames for processing.
(96) The method further includes obtaining 920 sub-band (i.e. band) information corresponding to the audio signal. Obtaining the band information may include (in some implementations) applying a weighted overlap-add (WOLA) filter-bank to the audio signal.
(97) The method further includes applying 930 the band information to one or more decision making units. The decision making units may include a beat detection (BD) unit that is configured to determine the presence or absence of a beat in the audio signal. The decision making units may also include a tone detection (TD) unit (i.e. tonality detection unit) that is configured to determine the presence or absence of one or more tones in the audio signal. The decision making units may also include a modulation activity (MA) tracking unit that is configured to determine the level (i.e., degree) of modulation in the audio signal.
(98) The method further includes combining 940 the results (i.e., the status, the state) of each of the one or more decision units. The combining may include applying a weight to each output of the one or more decision making units and then summing the weighted values to obtain a music score. The combination can be understood as similar to a combination associated with computing a node in a neural network. Accordingly, in some (more complex) implementations the combining 940 may include applying the output of the one or more decision making units to a neural network (e.g., deep neural network, feed forward neural network).
(99) The method further includes determining 950 music (or no-music) in the audio signal from the combined results of the decision making units. The determining may include accumulating music scores from frames (e.g., for a time period, for a number of frames) and then averaging the music scores. The determining may also include comparing the accumulated and averaged music score to a threshold. For example, when the accumulated and average music score is above the threshold then music is considered present in the audio signal, and when the accumulated and averaged music score is below the threshold then music is considered absent from the audio signal. The determining may also include applying hysteresis control to the threshold comparison so that a previous state of music/no-music influences the determination of the present state to prevent music/no-music states from fluttering back and forth.
(100) The method further includes modifying 960 the audio based on the determination of music or no-music. The modifying may include adjusting a noise reduction so that music levels are not reduces as if there were noise. The modifying may also include disabling a feedback canceller so that tones in the music are not cancelled as if they were feedback. The modifying may also include increasing a pass band for the audio signal so that the music is not filtered.
(101) The method further includes transmitting 970 the modified audio signal. The transmitting may include converting a digital audio signal to an analog audio signal using a D/A converter. The transmitting may also include coupling the audio signal to a speaker.
(102) In the specification and/or figures, typical embodiments have been disclosed. The present disclosure is not limited to such exemplary embodiments. The use of the term “and/or” includes any and all combinations of one or more of the associated listed items. The figures are schematic representations and so are not necessarily drawn to scale. Unless otherwise noted, specific terms have been used in a generic and descriptive sense and not for purposes of limitation.
(103) The disclosure describes a plurality of possible detection features and combination methods for a robust and power efficient music classification. For example, the disclosure describes, a neural network based beat detector that can use a plurality of possible features extracted from a selection of (decimated) frequency band information. When specific math is disclosed (e.g., a variance calculation for a tonality measurement) it may be described as inexpensive (i.e., efficient) from a processing power (e.g., cycles, energy) standpoint. While these aspects and others have been illustrated as described herein, many modifications, substitutions, changes, and equivalents will now occur to those skilled in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the scope of the implementations. It should be understood that they have been presented by way of example only, not limitation, and various changes in form and details may be made. Any portion of the apparatus and/or methods described herein may be combined in any combination, except mutually exclusive combinations. The implementations described herein can include various combinations and/or sub-combinations of the functions, components, and/or features of the different implementations described.