Method and apparatus for audio signal processing evaluation
11636844 · 2023-04-25
Assignee
Inventors
Cpc classification
G10L13/02
PHYSICS
G10L21/02
PHYSICS
International classification
G10L13/02
PHYSICS
G10L21/02
PHYSICS
Abstract
A method and an apparatus for audio signal processing evaluation are provided. The audio signal processing is performed on a synthesized audio signal to generate a processed audio signal. The synthesized audio signal is generated by adding a secondary signal into a master signal. The master signal is merely a speech signal. The signal processing is related to removing the secondary signal from the synthesized audio signal. The sound characteristics of the processed audio signal and the master signal are obtained, respectively. The sound characteristics include text content, and the text content is generated by performing speech-to-text on the processed audio signal and the master signal. The audio signal processing is evaluated according to the compared result between the sound characteristics of the processed audio signal and the master signal. The compared result includes the correctness of the text content of the processed audio signal relative to the master signal.
Claims
1. An audio signal processing evaluation method, comprising: performing an audio signal processing on a synthesized audio signal so as to generate a processed audio signal, wherein the synthesized audio signal is generated by adding a second signal to a main signal, the main signal is merely a speech signal, and the audio signal processing is related to filtering the secondary signal from the synthesized audio signal; obtaining a sound characteristics of the processed audio signal and the main signal respectively, wherein the sound characteristics comprises text content and a voiceprint feature, and the text content is generated by performing speech-to-text on the processed audio signal and the main signal; and evaluating the audio signal processing based on a comparison result between the sound characteristics of the processed audio signal and the main signal, wherein the comparison result comprises correctness of the text content of the processed audio signal corresponding to the main signal, the comparison result further comprises a voiceprint similarity of the voiceprint feature between the processed audio signal and of the main signal, the voiceprint similarity is related to a distance between a characteristic vector of the processed audio signal and of the main signal, and the characteristic vector is converted by the voiceprint feature; wherein evaluating the audio signal processing further comprises: comparing a character difference between the text content of the processed audio signal and of the main signal, wherein the character difference is related to whether corresponding characters in the text content are the same; determining a text correctness rate of the processed audio signal relative to the main signal based on the character difference, wherein the correctness of the text content is related to the text correctness rate; and determining a completeness of the processed audio signal based on
I=d.sub.2.Math.e.sup.−α.Math.d.sup.
2. The method for evaluating audio signal processing as described in claim 1, wherein the text correctness rate is a ratio of a number of identical texts to all characters of the text content.
3. The audio signal processing evaluation method as described in claim 1, wherein evaluating the audio signal processing comprises: determining that the higher the voiceprint similarity and the higher the correctness of the text content corresponding to a better evaluation result; and determining that the lower the voiceprint similarity or the lower the correctness of the text content corresponding to a poorer evaluation result.
4. The audio signal processing evaluation method as described in claim 3, wherein evaluating the audio signal processing comprises: determining the closer the distance as the higher the voiceprint similarity; and determining the farther the distance as the lower the voiceprint similarity.
5. The audio signal processing evaluation method as described in claim 4, wherein evaluating the audio signal processing comprises: determining the closer the distance and the higher the text correctness rate as the higher the voiceprint similarity; and determining the farther the distance or the lower the text correctness rate as the lower the voiceprint similarity.
6. An audio signal apparatus for processing evaluation, comprising: a storage, storing program code; and a processor, coupled to the storage, loading the program code to be configured for: performing an audio signal processing on a synthesized audio signal so as to generate a processed audio signal, the synthesized audio signal is generated by adding a main signal to a main signal, the main signal is merely a speech signal, and the audio signal processing is related to filtering the secondary signal from the synthesized audio signal; obtaining a sound characteristics of the processed audio signal and the main signal respectively, wherein the sound characteristics comprises text content and a voiceprint feature, and the text content is generated by performing speech-to-text on the processed audio signal and the main signal; evaluating the audio signal processing based on a comparison result of the sound characteristics between the processed audio signal and the main signal, wherein the comparison result comprises a correctness rate of the text content of the processed audio signal corresponding to the main signal, and the comparison result further comprises a voiceprint similarity of the voiceprint feature between the processed audio signal and the main signal; comparing a character difference between the text content of the processed audio signal and of the main signal, wherein the character difference is related to whether corresponding characters in the text content are the same; determining a text correctness rate of the processed audio signal relative to the main signal based on the character difference, wherein the correctness of the text content is related to the text correctness rate; and determining a completeness of the processed audio signal based on
I=d.sub.2.Math.e.sup.−α.Math.d.sup.
7. The audio signal apparatus for processing evaluation described in claim 6, wherein the text correctness rate is a ratio of a number of identical texts to all characters of the text content.
8. The audio signal apparatus for processing evaluation as described in claim 6, wherein the processor is further configured for: determining that the higher the voiceprint similarity and the higher the correctness of the text content corresponds to a better evaluation result; and determining that the lower the voiceprint similarity or the lower the correctness of the text content corresponds to a poorer evaluation result.
9. The audio signal apparatus for processing evaluation described in claim 8, wherein the processor is further configured for: determining the closer the distance as the higher voiceprint similarity; and determining the farther the distance as the lower voiceprint similarity.
10. The audio signal apparatus for processing evaluation described in claim 9, wherein the processor is further configured for: determining the closer the distance and the higher the text correctness rate as the higher the voiceprint similarity; and determining the farther the distance or the lower the text correctness rate as the lower the voiceprint similarity.
11. An audio signal processing evaluation method, comprising: performing an audio signal processing on a synthesized audio signal so as to generate a processed audio signal, wherein the synthesized audio signal is generated by adding a second signal to a main signal, the main signal is merely a speech signal, and the audio signal processing is related to filtering the secondary signal from the synthesized audio signal; obtaining a sound characteristics of the processed audio signal and the main signal respectively, wherein the sound characteristics comprises text content and a voiceprint feature, and the text content is generated by performing speech-to-text on the processed audio signal and the main signal; and evaluating the audio signal processing based on a comparison result between the sound characteristics of the processed audio signal and the main signal, wherein the comparison result comprises correctness of the text content of the processed audio signal corresponding to the main signal, the comparison result further comprises a voiceprint similarity of the voiceprint feature between the processed audio signal and of the main signal, the voiceprint similarity is related to a distance between a characteristic vector of the processed audio signal and of the main signal, and the characteristic vector is converted by the voiceprint feature; wherein evaluating the audio signal processing further comprises: comparing a character difference between the text content of the processed audio signal and of the main signal, wherein the character difference is related to whether corresponding characters in the text content are the same; determining a text correctness rate of the processed audio signal relative to the main signal based on the character difference, wherein the correctness of the text content is related to the text correctness rate; determining that the higher the voiceprint similarity and the higher the correctness of the text content corresponding to a better evaluation result; determining that the lower the voiceprint similarity or the lower the correctness of the text content corresponding to a poorer evaluation result; determining the closer the distance as the higher the voiceprint similarity, or determining the closer the distance and the higher the text correctness rate as the higher the voiceprint similarity; determining the farther the distance or the lower the text correctness rate as the lower the voiceprint similarity; and determining a completeness of the processed audio signal based on
I=d.sub.2.Math.e.sup.−α.Math.d.sup.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) The accompanying drawings are included to provide a further understanding of the disclosure, and are incorporated in and constitute a part of this specification. The drawings illustrate embodiments of the disclosure and, together with the description, serve to explain the principles of the disclosure.
(2)
(3)
(4)
DESCRIPTION OF THE EMBODIMENTS
(5) Reference will now be made in detail to the exemplary embodiments of the disclosure, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers are used in the drawings and the description to refer to the same or like parts.
(6)
(7) The storage 110 may be any type of fixed or removable random access memory (RAM), read only memory (ROM), flash memory, traditional hard disk drive (HDD), solid-state drive (SSD), or similar components. In one embodiment, the storage 110 is configured to record program codes, software modules (for example, a synthesization module 111, an audio signal processing module 113, a feature extraction module 115, and an evaluation module 117), configuration, data or files (for example, audio signal, sound characteristics, and evaluation results), which will be detailed in subsequent embodiments.
(8) The processor 150 is coupled to the storage 110. The processor 150 may be a central processing unit (CPU), a graphics processing unit (GPU), or other programmable general-purpose or special-purpose micro-processors, digital signal processor (DSP), programmable controller, field programmable logic gate array (FPGA), application-specific integrated circuit (ASIC), neural network accelerator, or other similar components, or a combination of the above components. In one embodiment, the processor 150 is configured to execute all or part of the operations of the audio signal apparatus for processing evaluation 100, and may load and execute various program codes, software modules, files, and data recorded by the storage 110.
(9) Hereinafter, various components, modules, and signals in the audio signal apparatus for processing evaluation 100 will be used to describe the method according to the embodiment of the disclosure. Each process of the method may be adjusted accordingly depending on the implementation situation, and is not limited to thereto.
(10)
(11) In one embodiment, the synthesization module 111 may, for example, superimpose the two signals S.sup.M and S.sup.S on the spectrum or use other synthesis techniques. In another embodiment, the audio signal apparatus for processing evaluation 100 may simultaneously play the main signal S.sup.M and the secondary signal S.sup.S through a built-in or external speaker, and further record the two signals, so as to obtain the synthesized audio signal S.sup.C.
(12) On the other hand, in one embodiment, the audio signal processing performed by the audio signal processing model 113 on the synthesized audio signal S.sup.C is related to filtering the secondary signal S.sup.S from the synthesized audio signal S.sup.C. For example, one of the purposes of the audio signal processing is to restore the main signal S.sup.M or eliminate noise. Noise reduction and suppression (or audio source separation) technology may be, for example, to generate a signal with an opposite phase of a noise audio wave, or to eliminate noise from the synthesized audio signal S.sup.C (i.e. the secondary signal S.sup.S) with an independent component analysis (ICA). The embodiments of the disclosure is not limited thereto.
(13) It is worth noting that the audio signal processing based on different technologies may have differences in frequency, waveform or amplitude of the output signal from the same input signal. To evaluate multiple audio signal processing technologies, the audio signal processing module 113 may integrate the audio signal processing technologies, and use different audio signal processing technologies to process the synthesized audio signal S.sup.C. Further, to understand the filtering ability of a specific audio signal processing on different secondary signals S.sup.S, different secondary signals S.sup.s may be added separately.
(14) In one embodiment, the audio signal apparatus for processing evaluation 100 may play the main signal S.sup.M and the processed audio signal S.sup.P through the built-in or external speaker and further record the two signals S.sup.M and S.sup.P respective for subsequent analysis.
(15) The feature extraction module 115 may obtain sound characteristics F.sup.P and F.sup.M for the processed audio signal S.sup.P and the main signal S.sup.M respectively (step S230). Specifically, the evaluation basis of the evaluation is such that the voiceprint feature of the main speech can still be saved after the audio signal is processed, and the semantic recognition can be improved. In one embodiment, the sound characteristics F.sup.P and F.sup.M include the voiceprint feature. The feature extraction module 115, for example, uses linear predictive coefficient (LPC), cepstrum coefficient, mel-frequency cepstrum coefficient (MFCC), or other feature parameter extraction methods so as to obtain the voiceprint feature. The voiceprint feature may be used to distinguish the voices of different people. It can be seen that one of the judgment basis for the evaluation is such that the listener can recognize the same person corresponding to the main signal S.sup.M after listening to the processed audio signal S.sup.P.
(16)
(17) In one embodiment, the sound characteristics F.sup.P and F.sup.M include text content. The feature extraction module 115 may perform speech-to-text on the processed audio signal S.sup.P and main signal S.sup.M so as to generate text content F.sub.2.sup.P and text content F.sub.2.sup.M (step S232). The speech-to-text may be based on, for example, feature extraction, acoustic models, pronunciation dictionaries, language models, decoders, or combinations thereof so as to output word strings with the largest or relatively large probability. The text content is the speech content in audio signals (expressed in text form). The text content may be used to understand semantics. It can be seen that one of the judgment basis of the evaluation lies in such that the listener can recognize the correct content corresponding to the main signal S.sup.M after listening to the processed audio signal S.sup.P.
(18) In one embodiment, the sound characteristics F.sup.P and F.sup.M include both voiceprint feature and the text content.
(19) The evaluation module 117 may evaluate the audio signal processing performed by the audio signal processing module 113 based on a comparison result between the sound characteristics of the processed audio signal S.sup.P and the main signal S.sup.M (step S250). In one embodiment, for the voiceprint feature, the comparison result includes voiceprint similarity, and the evaluation module 117 may compare the voiceprint similarity of the voiceprint feature between the processed audio signal S.sup.P and the main signal S.sup.M; that is: whether the voiceprint feature of the processed audio signal S.sup.P is the same or similar to the voiceprint feature of the main signal S.sup.M.
(20) According to different feature extraction techniques, the method of voiceprint comparison may be different. Referring to
(21) In one embodiment, for the text content, the comparison result includes correctness of the text content of the processed audio signal S.sup.P corresponding to the main signal SM, such as the correctness of the two signals S.sup.P and S.sup.M corresponding to characters in the text content.
(22) Referring to
(23) In one embodiment, the comparison result includes both the correctness of the text content and the voiceprint similarity. The evaluation module 117 may determine that the higher the voiceprint similarity and the higher the correctness of the text content, the better the evaluation result (that is, the better the audio signal processing result). Also, the evaluation module 117 may determine that the lower the voiceprint similarity or the lower the correctness of the text content, the poorer the evaluation result (that is, the poorer the audio signal processing result).
(24) For example, the evaluation module 117 may calculate completeness I (step S253):
I=d.sub.2.Math.e.sup.−α.Math.d.sup.
α is a variable adjustment parameter (i.e., a constant), and the completeness I is related to the evaluation result. Assuming that text correctness d.sub.2 is between 0 and 1, the completeness I will be between 0 and 1. The completeness I is related to the evaluation result. The larger the value, the better the evaluation result (for example, the characteristics of the two signals S.sup.P and S.sup.M are closer), and the smaller the value, the poorer the evaluation result (for example, the characteristics of the two signals S.sup.P, S.sup.M are farther).
(25) In this way, when applying the speech-related audio signal processing to reduce noise during the evaluation of conversation, whether the speech-related audio signal processing can simultaneously save the voiceprint feature of the main speech and improve the semantic recognition can be determined.
(26) It should be noted that the quantitative method of the evaluation result is not limited to Formula (1) of completeness I, and can be adjusted by the user according to actual needs.
(27) In summary, in the audio signal processing evaluation method and apparatus of the embodiment of the disclosure, the sound characteristics of the main signal and the processed audio signal are analyzed, and the quality of the audio signal processing is determined based on the correctness of text/recognition and the voiceprint similarity. Accordingly, an objective result may be provided.
(28) The disclosure has been disclosed in the above embodiments, but they are not to limit the disclosure. It will be apparent to those skilled in the art that various modifications and variations can be made to the structure of the disclosure without departing from the scope or spirit of the disclosure. In view of the foregoing, it is intended that the disclosure cover modifications and variations of this disclosure provided they fall within the scope of the following claims and their equivalents.