System and a method for sound recognition
11581009 · 2023-02-14
Assignee
Inventors
- Alex Boudreau (Sherbrooke, CA)
- Michel Pearson (Québec, CA)
- Louis-Alexis Boudreault (Québec, CA)
- Shean De Montigny-Desautel (Québec, CA)
Cpc classification
G10L21/02
PHYSICS
International classification
G10L25/18
PHYSICS
Abstract
A method for automatic for sound recognition, comprising a) raw spectrogram generation from a sound signal spectrum; b) wide-band spectrum determination; c) wide-band continuous spectrum determination; d) tonal and time-transient spectrum determination; wide-band continuous spectrogram and tonal and time-transient spectrogram determination; and) spectrogram image generation.
Claims
1. A method for automatic for identification of a sound event, comprising: a) recording of a sound signal spectrum of the sound event in a time frame of interest; b) raw spectrogram generation from the sound signal spectrum in the time frame of interest; c) wide-band spectrum determination and wide-band continuous spectrum determination; determining characteristics of the sound event based on frequency tones and temporal transitions by: d) tonal and time-transient spectrum determination and: e) wide-band continuous spectrogram and tonal and time-transient spectrogram determination; and f) spectrogram image generation using a tonal and time-transient spectrogram and wide-band continuous spectrogram obtained in d) and e) and combining the wide-band spectrum and the tonal and time-transient spectrum into spectrogram image frames comprising image features of the sound event; and g) identification of the sound event from the image features supplied in the spectrogram image frames generated in f) by image recognition, and returning the identification of the sound event.
2. The method of claim 1, wherein step b) comprises splitting the sound signal into filtered time signals using a fractional octave filter bank, yielding a filtered time-signal per frequency band; step c) comprises using a wide-band spectral envelope and using an exponential percentile estimator applied on the wide-band spectrum; step d) comprises subtracting the wide-band continuous spectrum from the sound signal spectrum; and step e) comprises using the tonal and time-transient spectrogram and the wide-band continuous spectrogram.
3. The method of claim 2, wherein step b) comprises using a frequency-adapted band filter time response.
4. The method of claim 2, wherein step c) comprises selecting using a cubic spline minimizing the following relation:
5. The method of claim 2, wherein step b) comprises using a frequency-adapted band filter time response, and step b) comprises: selecting using a cubic spline minimizing the following relation:
6. The method of claim 2, wherein step c) comprises selecting a frequency-adapted time constant for each frequency band signal.
7. The method of claim 2, wherein step c) comprises selecting a frequency-adapted time constant for each frequency band signal, the time constant being selected to be shorter at high frequency and longer at low frequency.
8. The method of claim 2, wherein step c) comprises selecting a frequency-adapted time constant for each frequency band signal as follows:
9. The method of claim 2, wherein step c) comprises using an asymmetrical weight exponential average as a percentile estimator, expressed as follows:
∝=e.sup.(−1/Fs.Math.τ) with F.sub.s is a sampling frequency in Hertz and τ is a time constant, in seconds, selected with respect to the value x[n] of input sample n as a frequency-adapted time constant for each frequency band signal.
10. The method of claim 2, wherein step c) comprises using an asymmetrical weight exponential average as a percentile estimator, expressed as follows:
∝=e.sup.(−1/Fs.Math.τ) with F.sub.s is a sampling frequency in Hertz and τ is a time constant, in seconds, selected with respect to the value x[n] of input sample n as a frequency-adapted time constant for each frequency band signal as follows:
11. The method of claim 2, wherein step c) comprises selecting a spectral envelope by using a cubic spline minimizing the following relation:
∝=e.sup.(−1/Fs.Math.τ) with F.sub.s is a sampling frequency in Hertz and τ is a time constant in seconds selected with respect to the value x[n] of input sample n as a frequency-adapted time constant for each frequency band signal.
12. The method of claim 2, wherein step d) comprises shifting the wide-band continuous spectrum and subtracting the shifted subtracting the wide-band continuous spectrum from the raw spectrum.
13. The method of claim 2, wherein step c) comprises selecting a spectral envelope by using a cubic spline minimizing the following relation:
∝=e.sup.(−1/Fs.Math.τ) with F.sub.s is a sampling frequency in Hertz and τ is a time constant in seconds selected with respect to the value x[n] of input sample n as a frequency-adapted time constant for each frequency band signal; and step d) comprises subtracting the wide-band continuous spectrum from the raw spectrum.
14. The method of claim 2, wherein step e) comprises accumulating the wide-band continuous spectrum into the wide-band continuous spectrogram and accumulating the tonal and time-transient spectrum into the tonal and time-transient spectrogram.
15. The method of claim 2, wherein step f) comprises combining the wide-band continuous spectrogram and the tonal and time-transient spectrogram into spectrogram image frames.
16. The method of claim 2, wherein step f) comprises using a first channel to store the wide-band continuous spectrogram and a second channel to store the tonal and time-transient spectrogram.
17. The method of claim 2, wherein step f) comprises selecting a first dynamic range for generating tonal and time-transient spectrogram images, and a second dynamic range for generating wide-band continuous spectrogram images.
18. The method of claim 2, wherein step c) comprises selecting a spectral envelope by using a cubic spline minimizing the following relation:
∝=e.sup.(−1/Fs.Math.τ) with F.sub.s is a sampling frequency in Hertz and τ is a time constant in seconds selected with respect to the value x[n] of input sample n as a frequency-adapted time constant for each frequency band signal; step d) comprises subtracting the wide-band continuous spectrum from the raw spectrum; and step e) comprises accumulating the wide-band continuous spectrum into the wide-band continuous spectrogram and accumulating the tonal and time-transient spectrum into the tonal and time-transient spectrogram.
19. The method of claim 2, wherein step c) comprises selecting a spectral envelope by using a cubic spline minimizing the following relation:
∝=e.sup.(−1/Fs.Math.τ) with F.sub.s is a sampling frequency in Hertz and τ is a time constant in seconds selected with respect to the value x[n] of input sample n as a frequency-adapted time constant for each frequency band signal; step d) comprises subtracting the wide-band continuous spectrum; step e) comprises accumulating the wide-band continuous spectrum into the wide-band continuous spectrogram and accumulating the tonal and time-transient spectrum into the tonal and time-transient spectrogram; and step f) comprises combining the wide-band continuous spectrogram and the tonal and time-transient spectrogram into spectrogram image frames.
20. A method for identification of a sound event, comprising a) recording of a sound signal spectrum of the sound event in a time frame of interest; b) raw spectrogram generation from the sound signal spectrum in the time frame of interest; c) wide-band spectrum determination; and wide-band continuous spectrum determination; d) tonal and time-transient spectrum determination; e) wide-band continuous spectrogram and tonal and time-transient spectrogram determination; and f) spectrogram image generation; and g) identification of the sound event from images generated in f); wherein step b) comprises using a fractional octave filter bank using a frequency-adapted band filter time response, yielding a filtered time-signal per frequency band; step c) comprises using a wide-band spectral envelope and applying an exponential percentile estimator on the wide-band spectrum; step d) comprises subtracting the wide-band continuous spectrum from the raw sound signal spectrum; step e) comprises accumulating the wide-band continuous spectrum into the wide-band continuous spectrogram and accumulating the tonal and time-transient spectrum into the tonal and time-transient spectrogram; said steps d) and e) determining characteristics of the sound event based on frequency tones and temporal transitions; step f) comprises combining the wide-band continuous spectrogram and the tonal and time-transient spectrogram into spectrogram image frames comprising image features of the sound event; and said step g) comprises the identification of the sound event from the spectrogram image frames generated in f) by image recognition, and returning the identification of the sound event.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) In the appended drawings:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
(19)
(20)
(21)
(22)
(23)
DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS
(24) The present invention is illustrated in further details by the following non-limiting examples.
(25) A method according to an embodiment of an aspect of the present disclosure as illustrated for example in
(26) Audio signals recorded by field sound recorders may be transmitted to a web server for processing as described hereinabove, generating images to an artificial intelligence which returns the identification of the sound event. Alternatively, a self-contained system, such as a sound level meter equipped with an on-board processing unit performing the above steps, may be used.
(27) The time signals of the audio records are spectrally processed using a fractional octave filter bank, using a band filter time response adapted to the frequency, namely faster at high frequency and slower at low frequency. The signal is thus decomposed into N octave fractional-octave subbands, an octave-band being a frequency band where the highest frequency is twice the lowest frequency (step 30).
(28) The obtained logarithmic repartition of the spectrum frequencies results in a fine frequency resolution at low frequency and a broader resolution at high frequency, and a logarithmic bandwidth with respect to frequency, which balances the energy content between the low and high frequency ranges.
(29) The original audio signal is thus split into N filtered time signals, N being the number of frequency bands. The energetic content of the band-filtered time signals is determined using an exponential average (step 40), as follows:
(30)
(31) with y[n] an average result at sample n; x[n] the value of input sample n; and a an average weight determined as follows:
∝=e.sup.(−1/Fs.Math.τ)
(32) with Fs the sampling frequency in Hertz, and τ a time constant in seconds. A frequency-adapted time constant τ is selected to adjust for each frequency band signal, as follows:
(33)
(34) with F.sub.h an octave fraction filter upper cutoff frequency in Hertz, F.sub.l an octave fraction filter lower cutoff frequency in Hertz, and F.sub.c an octave fraction filter center frequency in Hertz.
(35) The time constant τ is thus longer at low frequency and shorter at high frequency. For instance for a 1/24 octave band filter centered on 50 Hz the time constant is 0.4 s, whereas for a 1/24 octave band filter centered on 5000 Hz the time constant is 0.0018 s.
(36) Then, the characteristics of the recorded sound event are determined, based on frequency tones, that is frequency peaks in the spectrum, and temporal transitions, that is peaks or sharp transitions in time. A whistle or a bird call are examples of sound events with strong tonal features, while a door slam or a gunshot are examples of sound events with strong temporal transients. The method comprises monitoring the tonal emergences and the temporal emergences of a sound with respect to the wide-band continuous background noise.
(37) The wide-band continuous spectrum of the background noise is determined using a wide-band spectral envelope (step 50) and an exponential percentile estimator applied on the thus determined wide-band spectrum (step 60).
(38) A spectral envelope fitting the lower boundary of the spectral properties of the row spectrum of the sound event in time is selected as representation of the general shape of the spectrum tones. The spectral envelope is determined using a cubic spline by weighting frequency dips more than frequency peaks in the spectrum curve, thereby allowing identifying the wide-band component of the spectrum.
(39) The cubic spline is determined by minimizing the following relation:
(40)
(41) where p is a spline balance or ratio between fit and smoothness, controlling the trade-off between fidelity to the data and roughness of the function estimate; w is a weight between 0 and 1 of every value of [y]; and f is a spline relation.
(42) The wide-band envelope spline curve is determined using a first, very smooth, spline curve representing mostly the center of the spectrum, and a second spline curve focusing on the local minima of the spectrum for representing the wide-band background noise. The first curve is defined using a unitary weight w.sub.1=1 for all points and a low spline balance, for example p.sub.1=0.0001; the second curve is defined using a unitary weight w.sub.2=1 for all points lying below the first spline curve and a very low weight, such as w.sub.3=0.00001 for example for every point lying above the first spline curve, and a higher spline balance p.sub.1>p.sub.1, for example p.sub.2=0.001. The values of the spline weights and spline balances are selected depending on the nature of the sound spectrum used as input and target fitting.
(43) In an embodiment of an aspect of the present disclosure, the percentiles are obtained using an asymmetrical weight exponential average as a percentile estimator, expressed as follows:
(44)
(45) where y[n] is the average result at sample n; x[n] is the value of input sample n; and a is an average weight, determined as follows:
∝=e.sup.(−1/Fs.Math.τ)
(46) where F.sub.s is the sampling frequency in Hertz and τ is the time constant in seconds. The value of the time constant τ is selected with respect to the current input value x[n]. A first time constant τ.sub.H is selected if the current input value is greater than or equal to the previous average and a second time constant τ.sub.L is selected if the current input value is lower than the previous average, as follows:
(47)
(48) Values of τ.sub.H and τ.sub.L are determined according to the desired percentile p between 0 and 1 and the apparent window duration T in seconds as follows:
τ.sub.H=p.sup.2×T
τ.sub.L=(1−p).sup.2×T.
(49) For instance, for a desired percentile p of L95% with a 10 s apparent window duration τ.sub.H=9.03 s and τ.sub.L=0.025 s.
(50)
(51) The thus obtained wide-band continuous spectrum is accumulated into a wide-band continuous spectrogram (step 70).
(52)
(53) The temporal transients associated with the sound events are identified using the time continuous background noise determination using exponential percentile estimator. The identification of tonal and time transient features is performed by comparing the current spectrum to the wide-band continuous background noise spectrum (step 60). As part of the present disclosure, it was shown that a wide-band continuous signal such as a pink noise shows a small but significant tonal and time variance, especially when the observation interval is short, in the range between about 10 ms and about 50 ms. This residual tonal and time variance implies a tonal and time emergence from the wide-band continuous background noise of approximately 10 dB. In the present method, any spectrum feature that emerges more than 10 dB from the wide-band continuous background noise spectrum is considered a tonal peak or a time transient. Thus, the spectrum of tonal and time transient emergences is obtained by the subtraction of the wide-band continuous background noise spectrum from the raw spectrum shifted up by 10 dB (steps 65, 80 in
(54) The thus obtained tonal and time-transient spectrum is accumulated into a tonal and time-transient spectrogram (step 90).
(55) The tonal and time-transient spectrogram shows the features of sound events such as a bird call, human speech, a car pass-by, a door slam, etc. In an embodiment of an aspect of the present disclosure, the tonal and time-transient spectrogram image is generated using a 10 dB dynamic on the raw spectrum from 0 dB to +10 dB for example, thereby clipping strong emergences of more than 10 dB, which allows to imprint an almost binary spectrogram enhancing the contours of the tonal and time-transient features of the spectrogram. The result is an almost white fingerprint on a black background. The specific value of the desired dynamic range may be different than the 10 dB value used herein, the value of 10 dB was determined arbitrarily to produce images with contrasting image features.
(56) The wide-band continuous spectrogram allows identification of sound events in absence of tonal or time transient features, such as in the case of wind blowing or a distant highway for example. Although not characterized by tonal nor temporal features, such types of sound events are identified by the shape of the wide-band continuous background noise. When generating the wide-band continuous background noise spectrogram image by normalizing the wide-band continuous background noise energy to the raw spectrogram using with a dynamic of 40 dB, the wide-band continuous spectrogram image is essentially black in cases of strong tonal and time-transient emergences, because it is below the 40 dB dynamic range. In cases of low or absent tonal and time-transient emergences, the wide-band continuous spectrogram image value is higher, and appears brighter. The specific value of the desired dynamic range can be different than the 40 dB value used herein. The value of 40 dB was determined arbitrarily to allow a good balance between the discrimination of wide-band continuous spectrogram when tonal and time-transient are present and a good representation of the wide-band continuous spectrogram when tonal and time-transient are absent.
(57)
(58) The obtained tonal and time-transient spectrogram and wide-band continuous spectrogram, instead of the raw spectrogram, are used for the spectrogram image generation (step 100)), by generating spectrogram images composed of a short interval series of spectra, with intervals in the range between about 10 ms and about 50 ms (step 110).
(59) In step 110, the wide-band continuous spectrogram and the tonal and time-transient spectrogram are then combined into spectrogram image frames. The images are analyzed using two channels. A first channel, for example green, is used to store the wide-band continuous spectrogram and a second channel, for example blue, to store the tonal and time-transient spectrogram. The use of these colors is arbitrary and does not have an impact on the end result. Red and green may be selected for example, with the same result, as illustrated in
(60) As people in the art will now be in a position to appreciate, the present method overcomes shortcomings inherent to short time Fourier transform (STFT) in spectral analysis by using an octave fraction filter bank. The energetic content of each band filtered signals is determined from the root mean square (RMS) average by selecting a window duration shorter than the band frequency at high frequency and longer than the band frequency at low frequency (step 40), thereby preventing discontinuities in the time series while effective from a computational point of view, in contrast to using a window duration selected on the basis of the duration of the interval at which the signal is to be sampled. In the latter case, a 50 ms window root mean square (RMS) average for instance is processed every 50 ms to get a time series, which fails to take into account the period of the signal under analysis, and may thus result in a variance problem, since a window of 50 ms on a 100 Hz signal only contains 5 signal periods in the analysis window whereas the same window duration contains 500 periods when analyzing a 10 kHz signal frequency, and as a result, the lower frequency root mean square (RMS) time history does not present the same variance than the high frequency root mean square (RMS) time history. The spectral envelope describing the general shape of the spectrum tones is selected to describe the lower boundary of the spectral properties of the original spectrum, thereby allowing identifying the wide-band component of the spectrum or spectrum floor (steps 50, 60;
(61) In the present method, spectrogram images composed of a short interval series of spectra, with intervals in the range between about 10 ms and about 50 ms using only the tonal and time-transient and the wide-band continuous spectrograms are used for the spectrogram image generation.
(62) For combining the of wide-band continuous and the tonal and time-transient spectrograms images, in the present method, a first channel is used to store the wide-band continuous spectrogram and a second channel is used to store the tonal and time-transient spectrogram for analysis of the images, as opposed to methods comprising analyzing images separately on their three constituent channels, namely red, green and blue (RGB) or hue, saturation and value (HSV) and using these three channels to store different aspects of the spectrogram to analyze, for example in cases of sound events, such as wind blowing or a distant highway for example, which are not characterized by any tones or time-transients, and for which the tonal and time transient spectrogram image is almost black and the wide-band continuous spectrogram image is bright and becomes significant to determine the nature of the sound.
(63) There is thus provided a method for automatic for sound recognition, comprising using a fractional octave band spectrum for spectrogram generation; using a wide-band spectral envelope to determine the wide-band background spectrum; using an exponential percentile estimator on the wide-band spectrum to determine the wide-band continuous background spectrum; subtracting the wide-band continuous spectrum from the raw spectrum to obtain the tonal and time-transient spectrum; and combining the wide-band continuous spectrogram image and tonal and time-transient spectrogram image to be used in an image recognition algorithm.
(64) The use of a fractional octave-band filter bank to generate the sound spectrum results a logarithmic repartition of frequencies and overcomes inherent problems of short time Fourier transform (STFT). This logarithmic mapping allows a fine frequency resolution at low frequency and a broad resolution at high frequency. The obtained logarithmic bandwidth with respect to frequency allows balancing the spectrum energy between low and high frequencies and a time response adapted to the frequency band, namely slow at low frequency and fast at high frequency.
(65) The use of a frequency-adapted exponential average allows overcoming variance issues associated with a fixed duration average while still offering a fast computation time.
(66) The combined use of a wide-band spectral envelope and an exponential percentile estimator allows accurately characterizing the wide-band continuous background noise spectrum, which in turn allows accurately identifying the tonal and time-transient spectrum, which is determinant in the identification of sound events.
(67) The combination of the wide-band continuous spectrogram image and the tonal and time-transient spectrogram image in a single image results in high value data to the image classification algorithm. The tonal and time-transient spectrogram image provides a fingerprint of the dominant features of a sound event; and the wide-band continuous spectrogram image supplies relevant information for sound events that do not contain any tonal or time-transient features. The dynamic properties of both spectrogram images allow discrimination between wide-band continuous events and tonal and time-transient events. The spectrogram image processing used to generate both spectrogram images minimizes non-relevant information contained in the raw spectrogram image that may otherwise slow down or interfere with efficiency and accuracy of the image classification algorithm.
(68) The background noise is thus removed from the spectrogram image to enhance the contrast of the sound events and the spectrogram image value is improved by a selected combination and sequence of signal processing steps. The presently disclosed spectrogram image processing allows selective identification of complex sound events which are harder to identify.
(69) The scope of the claims should not be limited by the embodiments set forth in the examples, but should be given the broadest interpretation consistent with the description as a whole.