METHOD AND DEVICE FOR SPEECH/MUSIC CLASSIFICATION AND CORE ENCODER SELECTION IN A SOUND CODEC
20230215448 · 2023-07-06
Inventors
Cpc classification
G10L19/20
PHYSICS
G10L15/02
PHYSICS
G10L19/12
PHYSICS
G10L25/18
PHYSICS
G10L19/008
PHYSICS
G10L19/22
PHYSICS
International classification
G10L19/22
PHYSICS
G10L15/02
PHYSICS
G10L19/12
PHYSICS
G10L19/008
PHYSICS
G10L25/18
PHYSICS
Abstract
Two-stage speech/music classification device and method classify an input sound signal and select a core encoder for encoding the sound signal. A first stage classifies the input sound signal into one of a number of final classes. A second stage extracts high-level features of the input sound signal and selects the core encoder for encoding the input sound signal in response to the extracted high-level features and the final class selected in the first stage.
Claims
1-136. (canceled)
137. A two-stage speech/music classification device for classifying an input sound signal and for selecting a core encoder for encoding the sound signal, comprising: at least one processor; and a memory coupled to the processor and storing non-transitory instructions that when executed cause the processor to implement: a first stage for classifying the input sound signal into one of a number of final classes; and a second stage for extracting high-level features of the input sound signal and for selecting the core encoder for encoding the input sound signal in response to the extracted high-level features and the final class selected in the first stage.
138. The two-stage speech/music classification device according to claim 137, wherein the first stage comprises a detector of onset/attack in the input sound signal based on relative frame energy.
139. The two-stage speech/music classification device according to claim 138, wherein the detector of onset/attack updates in every frame a cumulative sum of differences between a relative energy of the input sound signal in a current frame and a relative energy of the input sound signal in a previous frame if the relative energy of the input sound signal in the current frame is larger than the relative energy of the input sound signal in the previous frame.
140. The two-stage speech/music classification device according to claim 139, wherein the detector of onset/attack outputs a binary flag set to a first value if the cumulative sum is located within a given range to indicate detection of an onset/attack and, otherwise, is set to a second value to indicate no detection of onset/attack.
141. The two-stage speech/music classification device according to claim 137, wherein the first stage comprises an extractor of features of the input sound signal, and an outlier detector for detecting outlier features based on histograms of the extracted features.
142. The two-stage speech/music classification device according to claim 141, wherein the outlier detector calculates, for each feature, a lower bound and an upper bound, compares a value of the feature with the lower and upper bounds, and marks the feature whose value is lying outside a range defined between the lower and upper bounds as an outlier feature.
143. The two-stage speech/music classification device according to claim 141, wherein the outlier detector calculates the lower and upper bounds using a normalized version of the histogram of the feature, an index of a frequency bin containing a maximum value of the histogram for the feature, and a threshold.
144. The two-stage speech/music classification device according to claim 141, wherein the outlier detector determines a vector of the features as an outlier based on a number of detected outlier features, and wherein the outlier detector, instead of discarding the outlier vector, replaces the outlier features in the vector with feature values obtained from at least one previous frame.
145. The two-stage speech/music classification device according to claim 137, wherein the first stage comprises a non-linear feature vector transformer to transform non-normal features extracted from the input sound signal into features with a normal shape.
146. The two-stage speech/music classification device according to claim 145, wherein the non-linear feature vector transformer uses Box-Cox transformation to transform non-normal features into features with a normal shape, wherein the Box-Cox transformation performed by the non-linear feature vector transformer uses a power transform with an exponent, wherein different values of the exponent define different Box-Cox transformation curves, wherein the non-linear feature vector transformer selects a value of the exponent for the Box-Cox transformation based on a normality test, and wherein the Box-Cox transformation performed by the non-linear feature vector transformer uses a bias to ensure that all input values of the extracted features are positive.
147. The two-stage speech/music classification device according to claim 146, wherein the normality test produces a skew and kurtosis measure, and wherein the non-linear feature vector transformer applies the Box-Cox transformation only to features satisfying a condition related to the skew and kurtosis measure.
148. The two-stage speech/music classification device according to claim 137, wherein the first stage comprises an analyzer of principal components to reduce sound signal feature dimensionality and increase sound signal class discriminability, wherein the analyzer of principal components performs an orthogonal transformation to convert a set of possibly correlated features extracted from the input sound signal into a set of linearly uncorrelated variables forming the principal components.
149. The two-stage speech/music classification device according to claim 137, wherein the first stage comprises a Gaussian Mixture Model (GMM) calculator to determine a first score proportional to a probability that a given vector of features extracted from the input sound signal was generated by a speech GMM, and a second score proportional to a probability that the given vector of features was generated by a music GMM, wherein the GMM calculator combines the first and second scores by calculating a difference between these first and second scores to produce a differential score.
150. The two-stage speech/music classification device according to claim 149, wherein a negative differential score is indicative that the input sound signal is speech and a positive differential score is indicative that the input sound signal is music.
151. The two-stage speech/music classification device according to claim 149, wherein the GMM calculator uses a decision bias in the calculation of the difference between the first and second scores.
152. The two-stage speech/music classification device according to claim 151, wherein the GMM calculator predicts, in active frames of a training database, labels indicative that the input sound signal is a speech, music or noise signal, and wherein the GMM calculator uses the labels to find the decision bias.
153. The two-stage speech/music classification device according to claim 137, wherein the number of final classes comprises a first final class related to speech, a second final class related to music, and a third final class related to speech with background music.
154. The two-stage speech/music classification device according to claim 149, wherein the first stage comprises a state-dependent categorical classifier of the input sound signal into one of three final classes including SPEECH/NOISE, MUSIC and UNCLEAR, wherein the final class UNCLEAR is related to speech with background music.
155. The two-stage speech/music classification device according to claim 154, wherein when, in a currrent frame, the input sound signal is in an ENTRY state as determined by a state machine, the state-dependent categorical classifier selects one of the three final classes SPEECH/NOISE, MUSIC and UNCLEAR based on a weighted average of the differential scores calculated in frames in the ENTRY state preceding the current frame.
156. The two-stage speech/music classification device according to claim 155, wherein, in states of the input sound signal other than ENTRY as determined by the state machine, the state-dependent categorical classifier selects the final class SPEECH/NOISE, MUSIC or UNCLEAR based on a smoothed version of the differential score and the final class SPEECH/NOISE, MUSIC or UNCLEAR selected in the previous frame.
157. The two-stage speech/music classification device according to claim 154, wherein the state-dependent categorical classifier first initializes the final class in the current frame to the class SPEECH/NOISE, MUSIC or UNCLEAR set in a previous frame.
158. The two-stage speech/music classification device according to claim 156, wherein the state-dependent categorical classifier first initializes the final class in the current frame to the class SPEECH/NOISE, MUSIC or UNCLEAR set in the previous frame, and wherein, in the current frame, the state-dependent categorical classifier transitions from the final class SPEECH/NOISE, MUSIC or UNCLEAR set in the previous frame to another one of the final classes in response to the smoothed differential score crossing a given threshold.
159. The two-stage speech/music classification device according to claim 158, wherein the state-dependent categorical classifier transitions from the final class SPEECH/NOISE set in the previous frame to the final class UNCLEAR if a counter of ACTIVE frames is lower than a first threshold, a cumulative sum of differential frame energy is equal to zero, and the smoothed differential score is larger than a second threshold.
160. The two-stage speech/music classification device according to claim 154, wherein the state-dependent categorical classifier transitions from the final class SPEECH/NOISE set in a previous frame to the final class UNCLEAR if a short pitch flag which is a by-product of an open-loop pitch analysis of the input sound signal is equal to a given value, and a smoothed version of the differential score is larger than a given threshold.
161. The two-stage speech/music classification device according to claim 137, wherein the second stage comprises an extractor of additional high-level features of the input sound signal in a current frame, wherein the additional high-level features comprise a tonality of the input sound signal.
162. The two-stage speech/music classification device according to claim 154, wherein the second stage comprises an extractor of additional high-level features of the input sound signal in a current frame, wherein the additional high-level features comprise features selected from the group consisting of: (a) tonality of the input sound signal; (b) long-term stability of the input sound signal, wherein the extractor of additional high-level features produces a flag indicative of long-term stability of the input sound signal; (c) segmental attack in the input sound signal, wherein the extractor of additional high-level features produces an indicator of (a) a position of segmental attack in a current frame of the input sound signal or (b) absence of segmental attack; and (d) a spectral peak-to-average ratio forming a measure of spectral sharpness of the input sound signal calculated from a power spectrum of the input sound signal.
163. The two-stage speech/music classification device according to claim 137, wherein the second stage comprises a core encoder initial selector for conducting initial selection of the core encoder using (a) a relative frame energy, (b) the final class in which the input sound signal is classified by the first stage, and (c) the extracted high level features.
164. The two-stage speech/music classification device according to claim 149, wherein the second stage comprises a core encoder initial selector for conducting initial selection of the core encoder in response to the extracted high-level features and the final class selected in the first stage, and a refiner of the initial core encoder selection if a GSC core encoder is initially selected by the core encoder initial selector.
165. A two-stage speech/music classification device for classifying an input sound signal and for selecting a core encoder for encoding the sound signal, comprising: at least one processor; and a memory coupled to the processor and storing non-transitory instructions that when executed cause the processor to: classify, in a first stage, the input sound signal into one of a number of final classes; and in a second stage, extract high-level features of the input sound signal and select the core encoder for encoding the input sound signal in response to the extracted high-level features and the final class selected in the first stage.
166. A two-stage speech/music classification method for classifying an input sound signal and for selecting a core encoder for encoding the sound signal, comprising: in a first stage, classifying the input sound signal into one of a number of final classes; and in a second stage, extracting high-level features of the input sound signal and selecting the core encoder for encoding the input sound signal in response to the extracted high-level features and the final class selected in the first stage.
167. The two-stage speech/music classification method according to claim 166, comprising, in the first stage, detecting onset/attack in the input sound signal based on relative frame energy.
168. The two-stage speech/music classification method according to claim 166, comprising, in the first stage, extracting features of the input sound signal selected from the group consisting of: (a) an open-loop pitch feature; (b) a voicing measure feature; (c) a feature related to line spectral frequencies from LP analysis; (d) a feature related to residual energy from the LP analysis; (e) a short-term correlation map feature; (f) a non-stationarity feature; (g) a mel-frequency cepstral coefficients feature; (h) a power spectrum difference feature; and (i) a spectral stationarity feature.
169. The two-stage speech/music classification method according to claim 166, comprising, in the first stage, extracting features of the input sound signal, and detecting outlier features based on histograms of the extracted features.
170. The two-stage speech/music classification method according to claim 169, wherein detecting outlier features comprises calculating, for each feature, a lower bound and an upper bound, comparing a value of the feature with the lower and upper bounds, and marking the feature whose value is lying outside a range defined between the lower and upper bounds as an outlier feature.
171. The two-stage speech/music classification method according to claim 169, wherein detecting outlier features comprises determining a vector of the features as an outlier based on a number of detected outlier features and, instead of discarding the outlier vector, replacing the outlier features in the vector with feature values obtained from at least one previous frame.
172. The two-stage speech/music classification method according to claim 166, comprising, in the first stage, non-linear transformation of non-normal features extracted from the input sound signal into features with a normal shape.
173. The two-stage speech/music classification method according to claim 172, wherein the non-linear transformation comprises using Box-Cox transformation to transform non-normal features into features with a normal shape, wherein the Box-Cox transformation comprises using a power transform with an exponent, wherein different values of the exponent define different Box-Cox transformation curves, and selecting a value of the exponent for the Box-Cox transformation based on a normality test, and wherein the Box-Cox transformation comprises using a bias to ensure that all input values of the extracted features are positive.
174. The two-stage speech/music classification method according to claim 173, wherein the normality test produces a skew and kurtosis measure, and wherein the Box-Cox transformation is applied only to features satisfying a condition related to the skew and kurtosis measure.
175. The two-stage speech/music classification method according to claim 166, comprising, in the first stage, analyzing principal components to reduce sound signal feature dimensionality and increase sound signal class discriminability, wherein analyzing principal components comprises an orthogonal transformation to convert a set of possibly correlated features extracted from the input sound signal into a set of linearly uncorrelated variables forming the principal components.
176. The two-stage speech/music classification method according to claim 166, comprising, in the first stage, a Gaussian Mixture Model (GMM) calculation to determine a first score proportional to a probability that a given vector of features extracted from the input sound signal was generated by a speech GMM, and a second score proportional to a probability that the given vector of features was generated by a music GMM, wherein the GMM calculation comprises combining the first and second scores by calculating a difference between these first and second scores to produce a differential score.
177. The two-stage speech/music classification method according to claim 176, wherein a negative differential score is indicative that the input sound signal is speech and a positive differential score is indicative that the input sound signal is music.
178. The two-stage speech/music classification method according to claim 176, wherein the GMM calculation comprises using a decision bias in the calculation of the difference between the first and second scores.
179. The two-stage speech/music classification method according to claim 178, wherein the GMM calculation predicts, in active frames of a training database, labels indicative that the input sound signal is a speech, music or noise signal, and wherein the GMM calculation comprises using the labels to find the decision bias.
180. The two-stage speech/music classification method according to claim 166, wherein the number of final classes comprises a first final class related to speech, a second final class related to music, and a third final class related to speech with background music.
181. The two-stage speech/music classification method according to claim 176, comprising, in the first stage, a state-dependent categorical classification of the input sound signal into one of three final classes including SPEECH/NOISE, MUSIC and UNCLEAR, wherein the final class UNCLEAR is related to speech with background music.
182. The two-stage speech/music classification method according to claim 181, wherein when, in a currrent frame, the input sound signal is in an ENTRY state as determined by a state machine, the state-dependent categorical classification comprises selecting one of the three final classes SPEECH/NOISE, MUSIC and UNCLEAR based on a weighted average of the differential scores calculated in frames in the ENTRY state preceding the current frame.
183. The two-stage speech/music classification method according to claim 182, wherein, in states of the input sound signal other than ENTRY as determined by the state machine, the state-dependent categorical classification comprises selecting the final class SPEECH/NOISE, MUSIC or UNCLEAR based on a smoothed version of the differential score and the final class SPEECH/NOISE, MUSIC or UNCLEAR selected in the previous frame.
184. The two-stage speech/music classification method according to claim 181, wherein the state-dependent categorical classification comprises first initializing the final class in the current frame to the class SPEECH/NOISE, MUSIC or UNCLEAR set in a previous frame.
185. The two-stage speech/music classification method according to claim 166, comprising, in the second stage, extracting additional high-level features of the input sound signal in a current frame, wherein the additional high-level features comprise a tonality of the input sound signal.
186. The two-stage speech/music classification method according to claim 181, comprising, in the second stage, extracting additional high-level features of the input sound signal in a current frame, wherein the additional high-level features comprise at least one of the following features: (a) tonality of the input sound signal; (b) long-term stability of the input sound signal, wherein extracting additional high-level features comprises producing a flag indicative of long-term stability of the input sound signal; (c) segmental attack in the input sound signal, wherein extracting additional high-level features comprises producing an indicator of (a) a position of segmental attack in a current frame of the input sound signal or (b) absence of segmental attack; and (d) a spectral peak-to-average ratio forming a measure of spectral sharpness of the input sound signal, wherein extracting additional high-level features comprises calculating the spectral peak-to-average ratio from a power spectrum of the input sound signal.
187. The two-stage speech/music classification method according to claim 186, wherein (a) extracting the tonality of the input sound signal comprises expressing the tonality by a tonality flag reflecting both spectral stability and harmonicity in a lower frequency range of the input sound signal up to a given frequency, (b) extracting the tonality flag comprises (i) calculating the tonality flag using a correlation map forming a measure of signal stability and harmonicity in a number of first frequency bins, in the lower frequency range, of a residual energy spectrum of the input sound signal and calculated in segments of the residual energy spectrum where peaks are present, (ii) applying smoothing of the correlation map and calculating a weighted sum of the correlation map across the frequency bins within the lower frequency range of the input sound signal in the current frame to yield a single number, and (iii) setting the tonality flag by comparing the single number to an adaptive threshold, and (c) the two-stage speech/music classification method comprises, in the second stage, an initial selection of the core encoder using the following conditions: (a) if a relative frame energy is higher than a first value, the spectral peak-to-average ratio is higher than a second value, and the single number is higher than the adaptive threshold, a TCX core encoder is initially selected; (b) if condition (a) is not present and the final class in which the input sound signal is classified by the first stage is SPEECH/NOISE, an ACELP core encoder is initially selected; (c) if conditions (a) and (b) are not present and the final class in which the input sound signal is classified by the first stage is UNCLEAR, a GSC core encoder is initially selected; and (d) if conditions (a), (b) and (c) are not present, a TCX core encoder is initially selected.
188. The two-stage speech/music classification method according to claim 176, comprising, in the second stage, an initial selection of the core encoder in response to the extracted high-level features and the final class selected in the first stage, and refining the initial core encoder selection if a GSC core encoder is initially selected by the core encoder initial selection.
189. The two-stage speech/music classification method according to claim 188, wherein refining the initial core encoder selection comprises changing an initial selection of a GSC core encoder to a selection of an ACELP core encoder if (a) a ratio of an energy in a number of first frequency bins of a signal segment and a total energy of this signal segment is lower than a first value and (b) a short-term mean of the differential score is higher than a second value.
190. The two-stage speech/music classification method according to claim 188, wherein refining the initial core encoder selection comprises changing, for an input sound signal with short and stable pitch period, an initial selection of a GSC core encoder to (a) a selection of an ACELP core encoder if a smoothed version of the differential score is lower that a given value or (b) a selection of a TCX core encoder if the smoothed differential score is larger or equal to the given value.
191. The two-stage speech/music classification method according to claim 188, wherein refining the initial core encoder selection comprises changing an initial selection of a GSC core encoder to (a) a selection of a TCX core encoder in response to long-term stability of the input sound signal and (b) an open-loop pitch larger than a given value.
192. The two-stage speech/music classification method according to claim 188, wherein refining the initial core encoder selection comprises changing, if a segmental attack is detected in the input sound signal, an initial selection of a GSC core encoder to a selection of an ACELP core encoder provided that an indicator that a change of selection of core encoder is enabled has a first value, and a transition frame counter has a second value.
193. The two-stage speech/music classification method according to claim 188, wherein refining the initial core encoder selection comprises changing, if a segmental attack is detected in the input sound signal, an initial selection of a GSC core encoder to a selection of an ACELP core encoder provided that an indicator that a change of selection of core encoder is enabled has a first value, a transition frame counter has not a second value, and an indicator identifying a segment corresponding to a position of the attack in the current frame is larger than a third value.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0021] In the appended drawings:
[0022]
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
[0029]
[0030]
[0031]
[0032]
[0033]
[0034]
[0035]
[0036]
DETAILED DESCRIPTION
[0037] In recent years, 3GPP (3.sup.rd Generation Partnership Project) started working on developing a 3D (Three-Dimensional) sound codec for immersive services called IVAS (Immersive Voice and Audio Services), based on the EVS codec (See Reference [5] of which the full content is incorporated herein by reference).
[0038] The present disclosure describes a speech/music classification technique and a core encoder selection technique in an IVAS coding framework. Both techniques are part of a two-stage speech/music classification method of which the result is core encoder selection.
[0039] Although the speech/music classification method and device are based on that in EVS (See Reference [6] and Reference [1], Section 5.1.13.6, of which the full content is incorporated herein by reference), several improvements and developments have been implemented. Also, the two-stage speech/music classification method and device are described in the present disclosure, by way of example only, with reference to an IVAS coding framework referred to throughout this disclosure as IVAS codec (or IVAS sound codec). However, it is within the scope of the present disclosure to incorporate such a two-stage speech/music classification method and device in any other sound codec.
[0040]
[0041] The stereo sound processing and communication system 100 of
[0042] Still referring to
[0043] The left 103 and right 123 channels of the original analog stereo sound signal are supplied to an analog-to-digital (A/D) converter 104 for converting them into left 105 and right 125 channels of an original digital stereo sound signal. The left 105 and right 125 channels of the original digital stereo sound signal may also be recorded and supplied from a storage device (not shown).
[0044] A stereo sound encoder 106 codes the left 105 and right 125 channels of the original digital stereo sound signal thereby producing a set of coding parameters that are multiplexed under the form of a bit-stream 107 delivered to an optional error-correcting encoder 108. The optional error-correcting encoder 108, when present, adds redundancy to the binary representation of the coding parameters in the bit-stream 107 before transmitting the resulting bit-stream 111 over the communication link 101.
[0045] On the receiver side, an optional error-correcting decoder 109 utilizes the above mentioned redundant information in the received bit-stream 111 to detect and correct errors that may have occurred during transmission over the communication link 101, producing a bit-stream 112 with received coding parameters. A stereo sound decoder 110 converts the received coding parameters in the bit-stream 112 for creating synthesized left 113 and right 133 channels of the digital stereo sound signal. The left 113 and right 133 channels of the digital stereo sound signal reconstructed in the stereo sound decoder 110 are converted to synthesized left 114 and right 134 channels of the analog stereo sound signal in a digital-to-analog (D/A) converter 115.
[0046] The synthesized left 114 and right 134 channels of the analog stereo sound signal are respectively played back in a pair of loudspeaker units, or binaural headphones, 116 and 136. Alternatively, the left 113 and right 133 channels of the digital stereo sound signal from the stereo sound decoder 110 may also be supplied to and recorded in a storage device (not shown).
[0047] For example, the stereo sound encoder 106 of
1. Two-Stage Speech/Music Classification
[0048] As indicated in the foregoing description, the present disclosure describes a speech/music classification technique and a core encoder selection technique in an IVAS coding framework. Both techniques are part of the two-stage speech/music classification method (and corresponding device) the result of which is the selection of a core encoder for coding a primary (dominant) channel (in case of Time Domain (TD) stereo coding) or a down-mixed mono channel (in case of Frequency Domain (FD) stereo coding). The basis for the development of the present technology is the speech/music classification in the EVS codec (Reference [1]). The present disclosure describes modifications and improvements that were implemented therein and that are part of a baseline IVAS codec framework.
[0049] The first stage of the speech/music classification method and device in the IVAS codec is based on a Gaussian Mixture Model (GMM). The initial model, taken from the EVS codec, has been extended, improved and optimized for the processing of stereo signals.
[0050] In summary: [0051] The GMM model takes feature vectors as input and provides probabilistic estimates for three classes including speech, music and background noise. [0052] The parameters of the GMM model are trained on a large collection of manually labelled vectors of features of the sound signal. [0053] The GMM model provides probabilistic estimates for each of the three classes in every frame, for example 20-ms frame. Sound signal processing frames, including sub-frames, are well known to those of ordinary skill in the art, but further information about such frames can be found, for example, in Reference , [0054] An outlier detection logic ensures proper processing of frames where one or more features of the sound signal do not fulfil the condition of normal distribution. [0055] Individual probabilities are turned into a single, unbound, score by means of logistic regression. [0056] The two-stage speech/music classification device has its own state machine which is used to partition the incoming signal into one of four states. [0057] Adaptive smoothing is applied on the output score depending on the current state of the two-stage speech/music classification method and device. [0058] Fast reaction of the two-stage speech/music classification method and device in rapidly varying content is achieved with an onset/attack detection logic based on relative frame energy. [0059] The smoothed score is used to perform a selection among the following three categories of signal type: pure speech, pure music, speech with music.
[0060]
[0061] Referring to
[0072] The core encoder selection technique (second stage of the two-stage speech/music classification device and method) in the IVAS codec is built on top of the first stage of the two-stage speech/music classification device and method and delivers a final output to perform selection of the core encoder from ACELP (Algebraic Code-Excited Linear Prediction), TCX (Transform-Coded excitation) and GSC (Generic audio Signal Coder) as described in Reference [7], of which the full content is incorporated herein by reference). Other suitable core encoders can also be implemented within the scope of the present disclosure.
[0073] In summary: [0074] The selected core encoder is then applied to encode the primary (dominant) channel (in case of TD stereo coding) or the down-mixed mono channel (in case of FD stereo coding). [0075] The core encoder selection uses additional high-level features calculated over a window which is generally longer than a window used in the first stage of the two-stage speech/music classification device method. [0076] The core encoder selection uses its own attack/onset detection logic optimized for achieving seamless switching. The output of this attack/onset detector is different from the output of the attack/onset detector of the first stage. [0077] The core encoder is initially selected based on the output of the state-dependent categorical classifier 210 of the first stage. Such selection is then refined by examining additional high-level features and the output of the onset/attack detector of this second stage.
[0078]
[0079] Referring to
2. First Stage of the Two-Stage Speech/Music Classification Device and Method
[0083] First, it should be mentioned that the GMM model is trained using an Expectation-Maximization (EM) algorithm on a large, manually labeled database of training samples. The database contains the mono items used in the EVS codec and some additional stereo items. The total size of the mono training database is approximately 650 MB. The original mono files are converted to corresponding dual mono variants before being used as inputs to the IVAS codec. The total size of the additional stereo training database is approximately 700 MB. The additional stereo database contains real recordings of speech signals from simulated conversations, samples of music downloaded from open sources on the internet and some artificially created items. The artificially created stereo items are obtained by convolving mono speech samples with pairs of real Binaural Room Impulse Responses (BRIRs). These impulse responses correspond to some typical room configurations, e.g. small office, seminar room, auditorium, etc. The labels for the training items are created semi-automatically using the Voice Activity Detection (VAD) information extracted from the IVAS codec; this is not optimal but frame-wise manual labeling is impossible given the size of the database.
2.1 State Machine for Signal Partitioning
[0084] Referring to
[0085] The concept of state machine in the first stage is taken from the EVS codec. No major modifications have been made to the IVAS codec. The purpose of the state machine 201 is to partition the incoming sound signal into one of four states, INACTIVE, ENTRY, ACTIVE and UNSTABLE.
[0086]
[0087] The schematic diagram of
[0088] The INACTIVE state 401, indicative of background noise, is selected as the initial state.
[0089] The state machine 201 switches from the INACTIVE state 401 to the ENTRY state 402 when a VAD flag 403 (See Reference [1]) changes from “0” to “1”. In order to produce the VAD flag used by the first stage of the two-stage speech/music classification method and device, any VAD detector or SAD (Sound Activity Detection) detector may be utilized. The ENTRY state 402 marks the first onset or attack in the input sound signal after a prolonged period of silence.
[0090] After, for example eight frames 405 in the ENTRY state 402, the state machine 201 enters the ACTIVE state 404 which marks the beginning of a stable sound signal with sufficient energy (a given level of energy). If the energy 409 of the signal suddenly decreases while the state machine 201 is in the ENTRY state 402, the state machine 201 changes from the ENTRY state to the UNSTABLE state 407, corresponding to an input sound signal with a level of energy close to background noise. Also, if the VAD flag 403 changes from “1” to “0” while the state machine 201 is in the ENTRY state 402, the state machine 201 returns to the INACTIVE state 401. This ensures continuity of classification during short pauses.
[0091] If the energy 406 of the stable signal (ACTIVE state 404) suddenly drops closer to the level of background noise or the VAD flag 403 changes from “1” to “0”, the state machine 201 switches from the ACTIVE state 404 to the UNSTABLE state 407.
[0092] After a period of, for example, 12 frames 410 in the UNSTABLE state 407, the state machine 201 reverts to the INACTIVE state 401. If the energy 408 of the unstable signal suddenly increases or the VAD flag 403 changes from “0” to “1” while the state machine 201 is in the UNSTABLE state 407, the state machine 210 returns to the ACTIVE state 404. This ensures continuity of classification during short pauses.
[0093] In the following description, the current state of the state machine 201 is denoted f.sub.SM. The constants assigned to the individual states may be defined as follows:
[0094] In the INACTIVE and ACTIVE states, f.sub.SM corresponds to a single constant whereas in the UNSTABLE and ENTRY states, f.sub.SM takes on multiple values depending on the progression of the state machine 201. Thus, in the UNSTABLE and ENTRY states, f.sub.SM may be used as a short-term counter.
2.2 Onset/Attack Detector
[0095] Referring to
[0096] The onset/attack detector 202 and the corresponding onset/attack detection operation 252 are adapted to the purposes and functions of the speech/music classification of the IVAS codec. The objective comprises, in particular but not exclusively, localization of both the beginnings of speech utterances (attacks) and the onsets of musical clips. These events are usually associated with abrupt changes in the characteristics of the input sound signal. Successful detection of signal onsets and attacks after a period of signal inactivity allows a reduction of the impact of past information in the process of score smoothing (described herein below). The onset/attack detection logic plays a similar role as the ENTRY state 402 of
[0097] The relative frame energy E.sub.r may be computed as the difference between the frame energy in dB and the long-term average energy. The frame energy in dB may be computed using the following relation:
where E.sub.CB(i) are the average energies per critical band (See Reference [1]). The long-term average frame energy may be computed using the following relation:
with initial value
[0098] The parameter used by the onset/attack detector 252 is a cumulative sum of differences between the relative energy of the input sound signal in a current frame and the relative energy of the input sound signal in a previous frame updated in every frame. This parameter is initialized to 0 and updated only when the relative energy in the current frame, E.sub.r(n), is greater than the relative energy in the previous frame, E.sub.r(n - 1). The onset/attack detector 252 updates the cumulative sum v.sub.run(n) using, for example, the following relation:
where n is the index of the current frame. The onset/attack detector 252 uses the cumulative sum v.sub.run(n) to update a counter of onset/attack frames, v.sub.cnt. The counter of the onset/attack detector 252 is initialized to 0 and incremented by 1 in every frame in the ENTRY state 402 where v.sub.run > 5. Otherwise, it is reset to 0.
[0099] The output of the attack/onset detector 202 is a binary flag, f.sub.att, which is set to 1 for example when 0 < v.sub.run < 3 to indicate detection of an onset/attack. Otherwise, this binary flag is set to 0 to indicate no detection of onset/attack. This can be expressed as follows:
[0100] The operation of the onset/attack detector 202 is demonstrated, as a non-limitative example, by the graph of
2.3 Feature Extractor
[0101] Referring to
[0102] In the training stage of the GMM model, the training samples are resampled to 16 kHz, normalized to -26 dBov (dBov is a dB level relative to the overload point of the system) and concatenated. Then, the resampled and concatenated training samples are fed to the encoder of the IVAS codec to collect features using the feature extractor 203. For the purpose of feature extraction, the IVAS codec may be run in a FD stereo coding mode, TD stereo coding mode or any other stereo coding mode and at any bit-rate. As a non-limitative example, the feature extractor 203 is run in a TD stereo coding mode at 16.4 kbps. The feature extractor 203 extracts the following features used in the GMM model for speech/music/noise classification:
TABLE-US-00001 Features used in the GMM model symbol window length description T.sub.OL 30 ms open-loop pitch R.sub.xy3 30 ms voicing measure LSF 25 ms line spectral frequencies from the LP analysis ∈.sub.P 25 ms residual energy from the LP analysis (Levinson-Durbin) C.sub.map 20 ms short-term correlation map n.sub.sta 20 ms non-stationarity MFCC 20 ms mel-frequency cepstral coefficients P.sub.diff 20 ms power spectrum difference P.sub.sta 20 ms spectral stationarity
[0103] With the exception of the MFCC feature, all of the above features are already present in the EVS codec (See Reference [1]).
[0104] The feature extractor 203 uses the open-loop pitch T.sub.OL and the voicing measure
[0105] The MFCC feature is a vector of N.sub.mel values corresponding to mel-frequency cepstral coefficients, which are the results of a cosine transform of the real logarithm of the short-term energy spectrum expressed on a mel-frequency scale (See Reference [8], of which the full content is incorporated herein by reference).
[0106] The calculation of the last two features P.sub.diff and P.sub.sta uses, for example, the normalized per-bin power spectrum,
where P.sub.k is the per-bin power spectrum in the current frame calculated in the IVAS spectral analysis routine (See Reference [1]). The normalization is performed in the range .Math.k.sub.low, k.sub.high.Math. = .Math.3, 70.Math. corresponding to the frequency range of 150 - 3500 Hz.
[0107] The power spectrum difference, P.sub.diff, may be defined as
where the index (n) has been added to denote the frame index explicitly.
[0108] The spectral stationarity feature, P.sub.sta, may be calculated from the sum of ratios of the normalized per-bin power spectrum and the power differential spectrum, using the following relation:
The spectral stationarity is generally higher in frames containing frequency bins with higher amplitude and smaller spectrum difference at the same time.
2.4 Outlier Detector Based on Individual Feature Histograms
[0109] Referring to
[0110] The GMM model is trained on vectors of the features collected from the IVAS codec on the large training database. The accuracy of the GMM model is affected to a large extent by the statistical distribution of the individual features. Best results are achieved when features are distributed normally, for example when X ~ N(.Math., σ) where N represents a statistical distribution having a mean .Math. and a variance σ.
[0111] The GMM model can represent to some extent features with non-normal distribution. If the value of one or more features is significantly different from its mean value, the vector of features is determined as an outlier. Outliers usually lead to incorrect probability estimates. Instead of discarding the vector of features, it is possible to replace the outlier features, for example with feature values from the previous frame, an average feature value across a number of previous frames, or by a global mean value over a significant number of previous frames.
[0112] The detector 204 detects outliers in the first stage 200 of the two-stage speech/music classification device on the basis of the analysis of individual feature histograms (See for example
where H(i) is the feature histogram normalized such that max(H(i)) = 1, i is a frequency bin index ranging from 0 to l=500 bins, and i.sub.max is the bin containing a maximum value of the histogram for this feature. The threshold thr.sub.H is set to 1e.sup.-4. This specific value for the threshold thr.sub.H has the following explanation. If the true statistical distribution of the feature was normal with zero mean .Math. and variance σ, it could be re-scaled such that its maximum value was equal to 1. In that case the probability density function (PDF) could be expressed as
[0113] By substituting f.sub.xs(x|0,σ.sup.2) with the threshold thr.sub.H and rearranging the variables the following relation is obtained:
[0114] For thr.sub.H = 1e.sup.-4 the following is obtained:
[0115] Thus, applying the threshold of 1e.sup.-4 leads to trimming the probability density function to the range of ±2.83σ around the mean value provided that the distribution was normal and scaled such that the density probability function f.sub.xs(0|0, σ.sup.2) = 1. The probability that a feature value lies outside the trimmed range is given by, for example, the following relation:
where erf(.) is the Gauss error function known from the theory of statistics.
[0116] If the variance of the feature values was σ = 1, then the percentage of the detected outliers would be approximately 0.47%. The above calculations are only approximate since the true distribution of feature values is not normal. This is illustrated by the histogram of the non-stationarity feature n.sub.sta in
[0117] The lower H.sub.low and the upper H.sub.high bounds are calculated for each feature used by the first stage 250/200 of the two-stage speech/music classification method and device and stored in the memory of the IVAS codec. When running the encoder of the IVAS codec, the outlier detector 204 compares the value X.sub.j(n) of each feature j in the current frame n against the bounds H.sub.low and H.sub.high of that feature, and marks the features j having a value lying outside of the corresponding ranges defined between the lower and upper bounds as an outlier feature. This can be expressed as
where F is the number of features. The outlier detector 204 comprises a counter (not shown) of outlier features, c.sub.odv, representing the number of detected outliers, using, for example, the following relation:
[0118] If the number of outlier features is equal to or higher than, for example 2, then the outlier detector 204 set a binary flag, f.sub.out, to 1. This can be expressed as follows:
[0119] The flag f.sub.out is used for signaling that the vector of the features is an outlier. If the flag f.sub.out is equal to one, then the outlier features X.sub.j(n) are replaced, for example, with the values from the previous frame, as follows:
2.5 Short-Term Feature Vector Filter
[0120] Referring to
[0121] The speech/music classification accuracy is improved with feature vector smoothing. This can be performed by applying the following short-term Infinite Impulse Response (IIR) filter used as the short-term feature vector filter 205:
where X̃.sub.j(n) represents the short-term filtered features in frame n and a.sub.m = 0.5 is a so-called forgetting factor.
[0122] Feature vector smoothing (operation 255 of filtering a short-term feature vector) is not performed in frames in the ENTRY state 402 of
[0123] In the following description, the original symbol for feature values X.sub.j(n) is used instead of X̃.sub.j(n), i.e. it is assumed that
2.6 Non-Linear Feature Vector Transformation (Box-Cox)
[0124] Referring to
[0125] As shown by the histograms of
where λ is the exponent of the power transform which varies from -5 to +5 (See
where N is the number of samples of the feature in the training database.
[0126] During the training process, the non-linear feature vector transformer 206 considers and tests all values of the exponent λ to select an optimal value of exponent λ based on a normality test. The normality test is based on the D′Agostino and Pearson’s method as described in Reference [10], of which the full content is incorporated herein by reference, combining skew and kurtosis of the probability distribution function. The normality test produces the following skew and kurtosis measure r.sub.sk (S-K measure):
where s is the z-score returned by the skew test and k is the z-score returned by the kurtosis test. See Reference [11], of which the full content is incorporated herein by reference, for details about the skew test and the kurtosis test.
[0127] The normality test also returns a two-sided chi-squared probability for null hypothesis, i.e. that the feature values were drawn from a normal distribution. The optimal value of the exponent λ minimizes the S-K measure. This can be expressed by the following relation:
where the subscript j means that the above minimization process is done for each individual feature j = 1, .., F.
[0128] In the encoder, the non-linear feature vector transformer 206 applies the Box-Cox transformation only to selected features satisfying the following condition related to the S-K measure:
where r.sub.sk(j) is the S-K measure calculated on the jth feature before the Box-Cox transformation and
is the S-K measure after Box-Cox transformation with optimal value of exponent λ.sub.j. The optimal exponent values, λ.sub.j, and the associated biases, Δ.sub.j, of the selected features are stored in the memory of the IVAS codec.
[0129] In the following description, the original symbol for feature values X.sub.j(n) will be used instead of X.sub.box,j(n), i.e. it is assumed that
2.7 Principal Component Analyzer
[0130] Referring to
[0131] After the operation 255 of short-term feature vector filtering and the operation 256 of non-linear feature vector transformation, the principal component analyzer 207 standardizes the feature vector by removing a mean of the features and scaling them to unit variance. For that purpose, the following relation can be used:
where X̂.sub.j(n) represents the standardized feature, .Math..sub.j is the mean and s.sub.j the standard deviation of feature X.sub.j across the training database and, as mentioned above, n represents the current frame.
[0132] The mean .Math..sub.j and the deviation s.sub.j of feature X.sub.j may be calculated as follows:
with N representing the total number of frames in the training database.
[0133] In the following description, the original symbol for feature values X.sub.j(n) will be used instead of X́.sub.j(n), i.e. it is assumed that:
[0134] The principal component analyzer 207 then processes the feature vector using PCA where the dimensionality is reduced, for example, from F = 15 to F.sub.PCA = 12. PCA is an orthogonal transformation to convert a set of possibly correlated features into a set of linearly uncorrelated variables called principal components (See Reference [12], of which the full content is incorporated herein by reference). In the speech/music classification method, the analyzer 207 transforms the feature vectors using, for example, the following relation:
where X(n) is a F-dimensional column feature vector and W is a F × F.sub.PCA matrix of PCA loadings whose columns are the eigenvectors of X.sup.T(n)X(n), where the superscript T indicates vector transpose. The loadings are found by means of Singular Value Decomposition (SVD) of the feature samples in the training database. The loadings are calculated in the training phase only for active frames, for example in frames where the VAD flag is 1. The calculated loadings are stored in the memory of the IVAS codec.
[0135] In the following description, the original symbol for the vector of features X(n) will be used instead of Y(n), i.e. it is assumed that:
2.8 Gaussian Mixture Model (GMM)
[0136] Referring to
[0137] A multivariate GMM is parameterized by a mixture of component weights, component means and covariance matrices. The speech/music classification method uses three GMMs, each trained on its own training database, i.e. a “speech” GMM, a “music” GMM and a “noise” GMM. In a GMM with K components, each component has its own mean, .Math..sub.k and its covariance matrix, .Math..sub.k. In the speech/music classification method the three (3) GMMs are fixed with K=6 components. The component weights are denoted ϕ.sub.k, with the constraint that
so that the probability distribution is normalized. The probability p(X) that a given feature vector X is generated by the GMM may be calculated using the following relation:
In the above relation, calculation of the exponential function exp(...) is a complex operation. The parameters of the GMMs are calculated using an Expectation-Maximization (EM) algorithm. It is well known that an Expectation-Maximization algorithm can be used for latent variables (variables that are not directly observable and are actually inferred from the values of the other observed variables) in order to predict their values with the condition that the general form of probability distribution governing those latent variables is known.
[0138] To reduce the complexity of probability calculations, the above relation may be simplified by taking the logarithm of the inner term inside the summation term .Math., as follows:
[0139] The output of the above, simplified formula is called the “score”. The score is an unbounded variable proportional to the log-likelihood. The higher the score, the higher the probability that a given feature vector was generated by the GMM. The score is calculated by the GMM calculator 208 for each of the three GMMs. The score score.sub.S(X) on the “speech” GMM and the score score.sub.M(X) on the “music” GMM are combined into a single value Δ.sub.s(X) by calculating their difference to obtain a differential score Δ.sub.s(X), using, for example, the following relation:
Negative values of the differential score are indicative that the input sound signal is a speech signal whereas positive values are indicative that the input sound signal is a music signal. It is possible to introduce a decision bias b.sub.s in the calculation of the differential score dlp(X, b.sub.s) by adding a non-negative value to the differential score, using the following relation:
The value of the decision bias, b.sub.s, is found based on the ensemble of differential scores calculated on the training database. The process of finding the value of the decision bias b.sub.s can be described as follows.
[0140] Let X.sub.t represent a matrix of the feature vectors from the training database. Let y.sub.t be a corresponding label vector. Let the values of ground-truth SPEECH frames in this vector be denoted as +1.0 and the values in the other frames as 0. The total number of ACTIVE frames in the training database is denoted as N.sub.act.
[0141] The differential scores dlp(X, b.sub.s) may be calculated in the active frames in the training database after EM training, i.e. when the parameters of the GMM are known. It is then possible to predict labels y.sub.pred(n) in the active frames of the training database using, for example, the following relation:
where sign[.] is a signum function and dlp(X(n),b.sub.s = 0) represents the differential scores calculated under the assumption of b.sub.s = 0. The resulting values of the labels y.sub.pred(n) are either equal to +1.0 indicating SPEECH or 0 indicating MUSIC or NOISE.
[0142] The accuracy of this binary predictor can be summarized with the following four statistical measures:
where E.sub.r is the relative frame energy which is used as a sample weighting factor. The statistic measures have the following meaning: c.sub.tp is the number of true positives, i.e. the number of hits in the SPEECH class, c.sub.fp is the number of false positives, i.e. the number of incorrectly classified frames in the MUSIC class, c.sub.tn is the number of true negatives, i.e. the number of hits in the MUSIC/NOISE class and c.sub.fn is the number of false negatives, i.e. the number of incorrectly classified frames in the SPEECH class.
[0143] The above-defined statistics may be used to calculate a true positive rate, commonly referred to as the recall
and the true negative rate, commonly referred to as the specificity
The recall TPR and the specificity TNR may be combined into a single number by taking the harmonic mean of TPR and TNR using the following relation:
The result is called the harmonic balanced accuracy.
[0144] A value of the decision bias b.sub.s may be found by maximizing the above defined harmonic balanced accuracy achieved with the labels/predictors y.sub.pred(n), where b.sub.s is selected from the interval (-2, 2) in successive steps. The spacing of candidate values for the decision bias is approximately logarithmic with higher concentration of values around 0.
[0145] The differential score dlp(X, b.sub.s), calculated with the found value of the decision bias b.sub.s, is limited to the range of, for example, (-30.0, +30.0). The differential score dlp(X, b.sub.s) is reset to 0 when the VAD flag is 0 or when the total frame energy, E.sub.tot, is lower than 10 dB or when the speech/music classification method is in the ENTRY state 402 and either f.sub.att or f.sub.out are 1.
2.9 Adaptive Smoother
[0146] Referring to
[0147] The adaptive smoother 209 comprises, for example, an adaptive IIR filter to smooth the differential score dlp(X, b.sub.s) for frame n, identified as dlp(n), from the GMM calculator 208. The adaptive smoothing, filtering operation 259 can be described using the following operation:
where wdlp(n) is the resulting smoothed differential score, wght(n) is a so-called forgetting factor of the adaptive IIR filter, and n represents the frame index.
[0148] The forgetting factor is a product of three individual parameters as shown in the following relation:
[0149] The parameter wrelE(n) is linearly proportional to the relative energy of the current frame, E.sub.r(n), and may be calculated using the following relation:
The parameter wrelE(n) is limited, for example, to the interval (0.9, 0.99). The constants used in the relation above have the following interpretation. The parameter wrelE(n) reaches the upper threshold of 0.99 when the relative energy is higher than 15 dB. Similarly, the parameter wrelE(n) reaches the lower threshold of 0.9 when the relative energy is lower than -15 dB. The value of the parameter wrelE(n) influences the forgetting factor wght(n) of the adaptive IIR filter of smoother 209. Smoothing is stronger in energetically weak segments where it is expected that the features carry less relevant information about the input signal.
[0150] The parameter wdrop(n) is proportional to a derivative of the differential score dlp(n). First, a short-term mean dlp.sub.ST(n) of the differential score dlp(n) is calculated using, for example, the following relation:
[0151] The parameter wdrop(n) is set to 0 and is modified only in frames where the following two conditions are met:
[0152] Thus, the adaptive smoother 209 updates the parameter wdrop(n) only when the differential score dlp(n) has decreasing tendency and when it indicates that the current frame belongs to the SPEECH class. In the first frame, when the two conditions are met, and if dlp.sub.ST(n) > 0, the parameter wdrop(n) is set to
[0153] Otherwise, the adaptive smoother 209 steadily increases the parameter wdrop(n) using, for example, the following relation:
[0154] If the above defined two conditions are not true, the parameter wdrop(n) is reset to 0. Thus, the parameter wdrop(n) reacts to sudden drops of the differential score dlp(n) below the zero-level indicating potential speech onset. The final value of the parameter wdrop(n) is linearly mapped to the interval of, for example, (0.7, 1.0), as shown in the following relation:
[0155] Note that the value of wdrop(n) is “overwritten” in the formula above to simplify notation.
[0156] The adaptive smoother 209 calculates the parameter wrise(n) similarly as the parameter wdrop(n) with the difference that it reacts to sudden rises of the differential score dlp(n) indicating potential music onsets. The parameter wrise(n) is set to 0 but is modified in frames where the following conditions are met:
[0157] Thus, the adaptive smoother 209 updates the parameter wrise(n) only in the ACTIVE state 404 of the input sound signal (See
[0158] In the first frame, when the above three (3) specified conditions are met, and if the short-term mean dlp.sub.ST(n - 1) < 0, the third parameter wrise(n) is set to:
[0159] Otherwise, the adaptive smoother 209 steadily increases the parameter wrise(n) according to, for example, the following relation:
[0160] If the above three (3) conditions are not true, the parameter wrise(n) is reset to 0. Thus, the third parameter wrise(n) reacts to sudden rises of the differential score dlp(n) above the zero-level indicating potential music onset. The final value of the parameter wrise(n) is linearly mapped to the interval of, for example, (0.95, 1.0), as follows:
[0161] Note, that the value of the parameter wrise(n) is “overwritten” in the formula above to simplify notation.
[0162]
[0163] The forgetting factor wght(n) of the adaptive IIR filter of the adaptive smoother 209 is decreased in response to strong SPEECH signal content or strong MUSIC signal content. For that purpose, the adaptive smoother 209 analyzes a long-term mean .Math.̅.sub.dlp(n) and a long-term variance σ̅.sub.dlp(n) of the differential score dlp(n), calculated using, for example, the following relations:
[0164] In the ENTRY state 402 (
[0165] The expression r.sub.m2v(n) corresponds to a long-term standard deviation of the differential score. The forgetting factor wght(n) of the adaptive IIR filter of the adaptive smoother 259 is decreased in frames where r.sub.m2v(n) > 15 using, for example, the following relation:
[0166] The final value of the forgetting factor wght(n) of the adaptive IIR filter of the adaptive smoother 209 is limited to the range of, for example, (0.01, 1.0). In frames where the total frame energy, E.sub.tot(n), is below 10 dB, the forgetting factor wght(n) is set to, for example, 0.92. This ensures proper smoothing of the differential score dlp(n) during silence.
[0167] The filtered, smoothed differential score, wdlp(n), is a parameter for categorical decisions of the speech/music classification method, as described below.
2.10 State-Dependent Categorical Classifier
[0168] Referring to
[0169] The operation 260 is the final operation of the first stage 250 of the two-stage speech/music classification method and comprises a categorization of the input sound signal into the following three final classes: [0170] SPEECH/NOISE (0) [0171] UNCLEAR (1) [0172] MUSIC (2)
[0173] In the above, numbers in parentheses are the numeric constants associated with the three final classes. The above set of classes is slightly different than the classes that have been discussed so far in relation to the differential score. The first difference is that the SPEECH class and the NOISE class are combined. This is to facilitate the core encoder selection mechanism (described in the following description) in which an ACELP encoder core is usually selected for coding both speech signals and background noise. A new class has been added to the set, namely the UNCLEAR final class. Frames falling into this category are usually found in speech segments with a high level of additive background music. The smoothed differential scores wdlp(n) of frames in class UNCLEAR are mostly close to 0.
[0174] Let d.sub.SMC(n) denote the final class selected by the state-dependent categorical classifier 210.
[0175] When the input sound signal is, in the current frame, in the ENTRY state 402 (See
where n.sub.ENTRY marks the beginning (frame) of the ENTRY state 402 and α.sub.k(n - n.sub.ENTRY) are the weights corresponding to the samples of dlp(n) in the ENTRY state. Thus, the number of samples used in the weighted average wdlp.sub.ENTRY(n) ranges from 0 to 7 depending on the position of the current frame with respect to the beginning (frame) of the ENTRY state. This is illustrated in
TABLE-US-00002 Weights used for averaging in the ENTRY state n - n.sub.ENTRY α.sub.0 α.sub.1 α.sub.2 α.sub.3 α.sub.4 α.sub.5 α.sub.6 α.sub.7 0 1 1 0.6 0.4 2 0.47 0.33 0.2 3 0.4 0.3 0.2 0.1 4 0.3 0.25 0.2 0.15 0.1 5 0.233 0.207 0.18 0.153 0.127 0.1 6 0.235 0.205 0.174 0.143 0.112 0.081 0.05 7 0.2 0.179 0.157 0.136 0.114 0.093 0.071 0.05
[0176] If the absolute frame energy, E.sub.tot, is, in the current frame, lower than, for example, 10 dB, the state-dependent categorical classifier 210 sets the final class d.sub.SMC(n) to SPEECH/NOISE regardless of the differential score dlp(n). This is to avoid misclassifications during silence.
[0177] If the weighted average of differential scores in the ENTRY state wdlp.sub.ENTRY(n) is less than, for example, 2.0, the state-dependent categorical classifier 210 sets the final class d.sub.SMC(n) to SPEECH/NOISE.
[0178] If the weighted average of differential scores in the ENTRY state wdlp.sub.ENTRY(n) is higher than, for example, 2.0, the state-dependent categorical classifier 210 sets the final class d.sub.SMC(n) based on the non-smoothed differential score dlp(n) in the current frame. If dlp(n) is higher than, for example, 2.0, the final class is MUSIC. Otherwise, it is UNCLEAR.
[0179] In the other states (See
[0180] The decision can be changed by the state-dependent categorical classifier 210 if the smoothed differential score wdlp(n) crosses a threshold of class (See Table 3) that is different from the class selected in the previous frame. These transitions between classes are illustrated in
TABLE-US-00003 Thresholds for class transitions TO SPEECH UNCLEAR MUSIC FROM SPEECH/ NOISE >2.1 UNCLEAR <-2.5 >2.5 MUSIC <-1.0
[0181] As mentioned herein above, the transitions between classes are driven not only by the value of the smoothed differential score wdlp(n) but also by the final classes selected in the previous frames. A complete set of rules for transitions between the classes is shown in the class transition diagram of
[0182] The arrows in
[0183] In
where
is the normalized autocorrelation function in the current frame and the upper index [k] refers to the position of the half-frame window. The normalized autocorrelation function is computed as part of the open-loop pitch analysis module of of the IVAS codec (See Reference [1], Section 5.1.11.3.2).
[0184] The short pitch flag f.sub.sp may be set in the pre-selected frames as follows
where
and
[0185] In
[0186] The parameter v.sub.run(n) is defined in Section 2.2 (Onset/attack detection) of the present disclosure.
3. Core Encoder Selection
[0187]
[0188] In the second stage 350/300 of the two-stage speech/music classification method and device, the final class d.sub.SMC(n) selected by the state-dependent categorical classifier 210 is “mapped” into one of the three core encoder technologies of the IVAS codec, i.e. ACELP (Algebraic Code-Excited Linear Prediction), GSC (Generic audio Signal Coding) or TCX (Transform-Coded excitation). This is referred to as the three-way classification. This does not guarantee that the selected technology will be used as core encoder since there exist other factors affecting the decision such as bit-rate or bandwidth limitations. However, for common types of input sound signals the initial selection of core encoder technology is used.
[0189] Besides the class d.sub.SMC(n) selected by the state-dependent categorical classifier 210 in the first stage, the core encoder selection mechanism takes into consideration some additional high-level features.
3.1 Additional High-Level Features Extractor
[0190] Referring to
[0191] In the first stage 200/250 of the two-stage speech/music classification device and method, most features are usually calculated on short segments (frames) of the input sound signal not exceeding 80 ms. This allows for a quick reaction to events such as speech onsets or offsets in the presence of background music. However, it also leads to a relatively high rate of misclassifications. The misclassifications are mitigated to some extent by means of adaptive smoothing, described in above Section 2.9 but for certain types of signal this is not sufficiently efficient. Therefore, as part of the second stage 300/350 of the two-stage speech/music classification device and method, the class d.sub.SMC(n) can be altered in order to select the most appropriate core encoder technology for certain types of signal. To detect such types of signal, the detector calculates additional high-level features and/or flags, usually on longer segments of the input signal.
3.1.1 Long-Term Signal Stability
[0192] Long-term signal stability is a feature of the input sound signal that can be used for successful discrimination between vocal music from opera. In the context of core encoder selection, signal stability is understood as long-term stationarity of segments with high autocorrelation. The additional high-level features extractor 301 estimates the long-term signal stability feature based on the “voicing” measure,
In the equation above,
[0193] For higher robustness, the voicing parameter
[0194] If the smoothed voicing parameter cor.sub.LT(n) is sufficiently high and the variance cor.sub.var(n) of the voicing parameter is sufficiently low, then the input signal is considered as “stable” for the purposes of core encoder selection. This is measured by comparing the values, cor.sub.LT(n) and cor.sub.var(n), to predefined thresholds and setting a binary flag using, for example, the following rules:
[0195] The binary flag, f.sub.STAB(n), is an indicator of long-term signal stability and it is used in the core encoder selection discussed later in the present disclosure.
3.1.2 Segmental Attack Detection
[0196] The extractor 301 extracts the segmental attack feature from a number, for example 32, of short segments of the current frame n as illustrated in
[0197] In each segment, the additional high-level features extractor 301 calculates the energy E.sub.ata(k) using, for example, the following relation:
where s(n) is the input sound signal in the current frame n, k is the index of the segment, and i is the index of the sample in the segment. Attack position is then calculated as the index of the segment with the maximum energy, as follows:
[0198] The additional high-level features extractor 301 estimates the strength str.sub.ata of the attack by comparing the mean (numerator of the below relation) of the energy E.sub.ata(k) of the input sound signal s(n) from the attack (segment k = k.sub.ata) to the end (segment 31) of the current frame n against the mean (denominator of the below relation) of the energy E.sub.ata(k) of the input signal s(n) from the beginning (segment 0) to ¾ (Segment 24) of the current frame n. This estimation of the strength str.sub.ata is made using, for example, the following relation:
[0199] If the value str.sub.ata is higher than, for example, 8, then the attack is considered strong enough and the segment k.sub.ata is used as an indicator for signaling the position of the attack inside the current frame n. Otherwise, indicator k.sub.ata is set to 0 indicating that no attack was identified. The attacks are detected only in GENERIC frame types which is signaled by the IVAS frame type selection logic (See Reference [1]). To reduce false attack detections, the energy E.sub.ata(k.sub.ata) of segment k = k.sub.ata where the attack was identified is compared (str.sub.3_.sub.4(k)) to energies E.sub.ata(k) of segments in the first ¾ of the current frame n (segments 2 to 21), using, for example, the following relation:
[0200] If any of the comparison values str.sub.3_4(k) for segments k = 2, ..., 21 is less than, for example, 2, k ≠ k.sub.ata, then k.sub.ata is set to 0 indicating that no attack was identified. In other words, the energy of the segment containing the attack must be at least twice as high as the energy of other segments in the first ¾ of the current frame.
[0201] The mechanism described above ensures that attacks are detected mainly in the last ⅓ of the current frame which makes them suitable for encoding either with the ACELP technology or the GSC technology.
[0202] For unvoiced frames, classified as UNVOICED_CLAS, UNVOICED_TRANSITION or ONSET by the IVAS FEC classification module (See Reference [1]), the additional high-level features extractor 301 estimates the strength str.sub.ata of the attack by comparing the energy E.sub.ata(k.sub.ata) of the attack segment k = k.sub.ata (numerator of the below relation) to the mean (denominator of the below relation) of the energy E.sub.ata(k) in the previous 32 segments preceding the attack, using, for example, the relation:
[0203] In the above relation, negative indices in the denominator refer to the values of segmental energies E.sub.ata(k) in the previous frame. If the strength str.sub.ata, calculated with the formula above, is higher than, for example, 16 the attack is sufficiently strong and k.sub.ata is used for signaling the position of the attack inside the current frame. Otherwise, k.sub.ata is set to 0 indicating that no attack was identified. In case the last frame was classified as UNVOICED_CLAS by the IVAS FEC classification module, then the threshold is set to, for example, 12 instead of 16.
[0204] For unvoiced frames, classified as UNVOICED_CLAS, UNVOICED_TRANSITION or ONSET by the IVAS FEC classification module (See Reference [1]), there is another condition to be satisfied to consider the detected attack as sufficiently strong. The energy E.sub.ata(k) of the attack must be sufficiently high when compared to a long-term mean energy
with
is higher than 20. Otherwise, k.sub.ata is set to 0 indicating that no attack was identified.
[0205] In the case an attack has been already detected in the previous frame, k.sub.ata is reset to 0 in the current frame n preventing attack smearing effects.
[0206] For the other frame types (excluding UNVOICED and GENERIC as described above), the additional high-level features extractor 301 compares the energy E.sub.ata(k.sub.ata) of the segment k = k.sub.ata containing an attack against energies E.sub.ata(k) in the other segments in accordance with, for example, the following ratio:
and if any of the comparison values str.sub.other(k) for k = 2, ..., 21, k ≠ k.sub.ata is lower than, for example, 1.3, then the attack is considered weak and k.sub.ata is set to 0. Otherwise, segment k.sub.ata is used for signaling the position of the attack inside the current frame.
[0207] Thus, the final output of the additional high-level features detector 301 regarding segmental attack detection is the index k = k.sub.ata of the segment containing the attack or k.sub.ata = 0. If the index is positive, an attack is detected. Otherwise, no attack is identified.
3.1.3 Signal Tonality Estimation
[0208] Tonality of the input sound signal in the second stage of the two-stage speech/music classification device and method is expressed as a tonality binary flag reflecting both spectral stability and harmonicity in the lower frequency range of the input signal up to 4 kHz. The additional high-level features extractor 301 calculates this tonality binary flag from a correlation map, S.sub.map(n, k), which is a by-product of the tonal stability analysis in the IVAS encoder (See Reference [1]).
[0209] The correlation map is a measure of both signal stability and harmonicity. The correlation map is calculated from the first, for example, 80 bins of the residual energy spectrum in the logarithmic domain, E.sub.dB,res(k), k = 0,..,79 (See Reference [1]). The correlation map is calculated in segments of the residual energy spectrum where peaks are present. These segments are defined by the parameter i.sub.min(p) where p = 1, ...,N.sub.min is the segment index and N.sub.min is the total number of segments.
[0210] Let’s define the set of indices belonging to a particular segment x as
Then, the correlation map may be calculated as follows
[0211] The correlation map M.sub.cor(PK(p)) is smoothed with an IIR filter and summed across the bins in the frequency range k = 0, ..., 79 to yield a single number, using, for example, the following two relations:
where n denotes the current frame and k denotes the frequency bin. The weight β(n) used in the equation above is called the soft VAD parameter. It is initialized to 0 and may be updated in each frame as
where f.sub.VAD(n) is the binary VAD flag from the IVAS encoder (See Reference [1]). The weight β(n) is limited to the range of, for example, (0.05, 0.95). The extractor 301 sets the tonality flag f.sub.ton by comparing S.sub.mass with an adaptive threshold, thr.sub.mass. The threshold thr.sub.mass is initialized to, for example, 0.65 and incremented or decremented in steps of, for example, 0.01 in each frame. If S.sub.mass is higher than 0.65, then the threshold thr.sub.mass is increased by 0.01, otherwise it is decreased by 0.01. The threshold thr.sub.mass is upper limited to, for example, 0.75 and lower limited to, for example, 0.55. This adds a small hysteresis to the tonality flag f.sub.ton.
[0212] The tonality flag, f.sub.ton, is set to 1 if S.sub.mass is higher than thr.sub.mass. Otherwise, it is set to 0.
3.1.4 Spectral Peak-to-Average Ratio
[0213] Another high-level feature used in the core encoder selection mechanism is the spectral peak-to-average ratio. This feature is a measure of spectral sharpness of the input sound signal s(n). The extractor 301 calculates this high-level feature from the power spectrum of the input signal s(n) in logarithmic domain, S.sub.LT(n,k), k = 0, ...,79, for example in the range from 0 to 4 kHz. However, the power spectrum SLT(n, k) is first smoothed with an IIR filter using, for example, the following relation:
where n denotes the current frame and k denotes the frequency bin. The spectral peak-to-average ratio is calculated using, for example, the following relation:
3.2 Core Encoder Initial Selector
[0214] Referring to
[0215] The initial selection of the core encoder by the selector 302 is based on (a) the relative frame energy E.sub.r, (b) the final class d.sub.SMC(n) selected in the first stage of the two-stage speech/music classification device and method and (c) the additional high-level features r.sub.p2a(n), S.sub.mass, and thr.sub.mass as described herein above. The selection mechanism used by the core encoder initial selector 302 is depicted in the schematic diagram of
[0216] Let d.sub.core ∈ {0,1,2} denote the core encoder technology selected by the mechanism in
3.3 Core Encoder Selection Refiner
[0217] Referring to
[0218] The core encoder selection refiner 303 may change the core encoder technology when d.sub.core = 1, i.e. when the GSC core encoder is initially selected for core coding. This situation can happen for example for musical items classified as MUSIC with low energy below 400 Hz. The affected segments of the input signal may be identified by analyzing the following energy ratio:
where E.sub.bin(k), k = 0, ... ,127 is the power spectrum per frequency bin k of the input signal in linear domain and E.sub.tot is the total energy of the signal segment (frame).
[0219] The summation in the numerator extends over the first 8 frequency bins of the energy spectrum corresponding to a frequency range of 0 to 400 Hz. The core encoder selection refiner 303 calculates and analyzes the energy ratio rat.sub.LF in frames previously classified as MUSIC with a reasonably high accuracy. The core encoder technology is changed from GSC to ACELP under, for example, the following condition:
[0220] For signals with very short and stable pitch period, GSC is not the optimal core coder technology. Therefore, as a non-limitative example, when f.sub.sp = 1, the core encoder selection refiner 303 changes the core encoder technology from GSC to ACELP or TCX as follows:
[0221] Highly correlated signals with low energy variation are another type of signals for which the GSC core encoder technology is not suitable. For these signals, the core encoder selection refiner 303 switches the core encoder technology from GSC to TCX. As a non-limitative example, this change of core encoder is made when the following conditions are met:
where
is the absolute pitch value from the first half-frame of the open-loop pitch analysis (See Reference [1]) in current frame n.
[0222] Finally, in a non-limitative example, the core encoder selection refiner 303 may change the initial core encoder selection from GSC to ACELP in frames where an attack is detected, provided the following condition is fulfilled:
The flag f.sub.no_GSC is an indicator that the change of the core encoder technology is enabled.
[0223] The condition above ensures that this change of core encoder from GSC to ACELP happens only in segments with rising energy. If the condition above is fulfilled and, at the same time, a transition frame counter TC.sub.cnt has been set to 1 in the IVAS codec (Reference [1]), then the core encoder selection refiner 303 changes the core encoder to ACELP. That is:
Additionally, when the core encoder technology is changed to ACELP the frame type is set to TRANSITION. This means that the attack will be encoded with the TRANSITION mode of the ACELP core encoder.
[0224] If an attack is detected by the segmental attack detection procedure of the additional high-level features detection operation 351, as described in section 3.1.2 above, then the index (position) of this attack, k.sub.ata, is further examined. If the position of the detected attack is in the last sub-frame of frame n, then the core encoder selection refiner 303 changes the core encoder technology to ACELP, for example when the following conditions are fulfilled:
Additionally, when the core encoder technology is changed to ACELP the frame type is set to TRANSITION and a new attack “flag” f.sub.ata is set as follows
[0225] This means that the attack will be encoded with the TRANSITION mode of the ACELP core encoder.
[0226] If the position of the detected attack is not located in the last sub-frame but at least beyond the first quarter of the first sub-frame, then the core encoder selection is not changed and the attack will be encoded with the GSC core encoder. Similarly to the previous case, a new attack “flag” f.sub.ata may be set as follows:
[0227] The parameter k.sub.ata is intended to reflect the position of the detected attack, so the attack flag f.sub.ata is somewhat redundant. However, it is used in the present disclosure for consistency with other documents and with the source code of the IVAS codec.
[0228] Finally, core encoder selection refiner 303 changes the frame type from GENERIC to TRANSITION in speech frames for which the ACELP core coder technology has been selected during the initial selection. This situation happens only in active frames where the local VAD flag has been set to 1 and in which an attack has been detected by the segmental attack detection procedure of the additional high-level features detection operation 351, described in section 3.1.2, i.e. where k.sub.ata > 0.
[0229] The attack flag is then similar as in the previous situation. That is:
4. Example Configuration of Hardware Component
[0230]
[0231] The IVAS codec, including the two-stage speech/music classification device may be implemented as a part of a mobile terminal, as a part of a portable media player, or in any similar device. The IVAS codec, including the two-stage speech/music classification device (identified as 1500 in
[0232] The input 1502 is configured to receive the input sound signal s(n), for example the left and right channels of an input stereo sound signal in digital or analog form in the case of the encoder of the IVAS codec. The output 1504 is configured to supply an encoded multiplexed bit-stream in the case of the encoder of the IVAS codec. The input 1502 and the output 1504 may be implemented in a common module, for example a serial input/output device.
[0233] The processor 1506 is operatively connected to the input 1502, to the output 1504, and to the memory 1508. The processor 1506 is realized as one or more processors for executing code instructions in support of the functions of the various elements and operations of the above described IVAS codec, including the two-stage speech/music classification device and method as shown in the accompanying figures and/or as described in the present disclosure.
[0234] The memory 1508 may comprise a non-transient memory for storing code instructions executable by the processor 1506, specifically, a processor-readable memory storing non-transitory instructions that, when executed, cause a processor to implement the elements and operations of the IVAS codec, including the two-stage speech/music classification device and method. The memory 1508 may also comprise a random access memory or buffer(s) to store intermediate processing data from the various functions performed by the processor 1506.
[0235] Those of ordinary skill in the art will realize that the description of the IVAS codec, including the two-stage speech/music classification device and method are illustrative only and are not intended to be in any way limiting. Other embodiments will readily suggest themselves to such persons with ordinary skill in the art having the benefit of the present disclosure. Furthermore, the disclosed IVAS codec, including the two-stage speech/music classification device and method may be customized to offer valuable solutions to existing needs and problems of encoding and decoding sound, for example stereo sound.
[0236] In the interest of clarity, not all of the routine features of the implementations of the IVAS codec, including the two-stage speech/music classification device and method are shown and described. It will, of course, be appreciated that in the development of any such actual implementation of the IVAS codec, including the two-stage speech/music classification device and method, numerous implementation-specific decisions may need to be made in order to achieve the developer’s specific goals, such as compliance with application-, system-, network- and business-related constraints, and that these specific goals will vary from one implementation to another and from one developer to another. Moreover, it will be appreciated that a development effort might be complex and time-consuming, but would nevertheless be a routine undertaking of engineering for those of ordinary skill in the field of sound processing having the benefit of the present disclosure.
[0237] In accordance with the present disclosure, the elements, processing operations, and/or data structures described herein may be implemented using various types of operating systems, computing platforms, network devices, computer programs, and/or general purpose machines. In addition, those of ordinary skill in the art will recognize that devices of a less general purpose nature, such as hardwired devices, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), or the like, may also be used. Where a method comprising a series of operations and sub-operations is implemented by a processor, computer or a machine and those operations and sub-operations may be stored as a series of non-transitory code instructions readable by the processor, computer or machine, they may be stored on a tangible and/or non-transient medium.
[0238] Elements and processing operations of the IVAS codec, including the two-stage speech/music classification device and method as described herein may comprise software, firmware, hardware, or any combination(s) of software, firmware, or hardware suitable for the purposes described herein.
[0239] In the IVAS codec, including the two-stage speech/music classification device and method, the various processing operations and sub-operations may be performed in various orders and some of the processing operations and sub-operations may be optional.
[0240] Although the present disclosure has been described hereinabove by way of non-restrictive, illustrative embodiments thereof, these embodiments may be modified at will within the scope of the appended claims without departing from the spirit and nature of the present disclosure.
REFERENCES
[0241] The present disclosure mentions the following references, of which the full content is incorporated herein by reference:
[0242] 3GPP TS 26.445, v.12.0.0, “Codec for Enhanced Voice Services (EVS); Detailed Algorithmic Description”, September 2014.
[0243] M. Neuendorf, M. Multrus, N. Rettelbach, G. Fuchs, J. Robillard, J. Lecompte, S. Wilde, S. Bayer, S. Disch, C. Helmrich, R. Lefevbre, P. Gournay, et al., “The ISO/MPEG Unified Speech and Audio Coding Standard - Consistent High Quality for All Content Types and at All Bit Rates”, J. Audio Eng. Soc., vol. 61, no. 12, pp. 956-977, December 2013.
[0244] F. Baumgarte, C. Faller, “Binaural cue coding - Part I: Psychoacoustic fundamentals and design principles,” IEEE Trans. Speech Audio Processing, vol. 11, pp. 509-519, November 2003.
[0245] Tommy Vaillancourt, “Method and system using a long-term correlation difference between left and right channels for time domain down mixing a stereo sound signal into primary and secondary channels,” PCT Application WO2017/049397A1.
[0246] 3GPP SA4 contribution S4-170749 “New WID on EVS Codec Extension for Immersive Voice and Audio Services”, SA4 meeting #94, June 26-30, 2017, http://www.3gpp.org/ftp/tsg_sa/WG4 CODEC/TSGS4_94/Docs/S4-170749.zip
[0247] V. Malenovsky, T. Vaillancourt, W. Zhe, K. Choo and V. Atti, “Two-stage speech/music classifier with decision smoothing and sharpening in the EVS codec,” 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, QLD, 2015, pp. 5718-5722.
[0248] T. Vaillancourt and M. Jelinek, “Coding generic audio signals at low bitrates and low delay”, U.S. Pat. No. 9,015,038 B2.
[0249] K.S. Rao and A.K. Vuppala, Speech Processing in Mobile Environments, Appendix A: MFCC features, Springer International Publishing, 2014
[0250] Box, G. E. P. and Cox, D. R. (1964). An analysis of transformations, Journal of the Royal Statistical Society, Series B, 26, 211-252.
[0251] D′Agostino, R. and Pearson, E. S. (1973), “Tests for departure from normality”, Biometrika, 60, 613-622.
[0252] D′Agostino, A. J. Belanger and R. B. D′Agostino Jr., “A suggestion for using powerful and informative tests of normality”, American Statistician 44, pp. 316-321, 1990.
[0253] I. Jolliffe, Principal component analysis. New York: Springer Verlag, 2002.