HIERARCHICAL GENERATED AUDIO DETECTION SYSTEM

Abstract

Disclosed is a hierarchical generated audio detection system, comprising an audio preprocessing module, a CQCC feature extraction module, a LFCC feature extraction module, a first-stage lightweight coarse-level detection model and a second-stage fine-level deep identification model; the audio preprocessing module preprocesses collected audio or video data to obtain an audio clip with a length not exceeding the limit; inputting the audio clip into CQCC feature extraction module and LFCC feature extraction module respectively to obtain CQCC feature and LFCC feature; inputting CQCC feature or LFCC feature into the first-stage lightweight coarse-level detection model for first-stage screening to screen out the first-stage real audio and the first-stage generated audio; inputting the CQCC feature or LFCC feature of the first-stage generated audio into the second-stage fine-level deep identification model to identify the second-stage real audio and the second-stage generated audio, and the second-stage generated audio is identified as generated audio.

Claims

1. A hierarchical generated audio detection system, wherein the hierarchical generated audio detection system is a two-stage generated audio detection system, the system comprising: an audio preprocessing module; a CQCC (Constant Q Cepstral Coefficients) feature extraction module; and an LFCC (Linear Frequency Cepstrum Coefficients) feature extraction module, wherein the hierarchical generated audio detection system includes a first-stage lightweight coarse-level detection model and a second-stage fine-level deep identification model, and wherein performing a generated audio detection by the hierarchical generated audio detection system comprises: performing data preprocess of collected audio or video data by the audio preprocessing module so as to obtain an audio clip with a length not exceeding a predetermined limit; inputting the audio clip into the CQCC feature extraction module and the LFCC feature extraction module_respectively so as to obtain CQCC feature and LFCC feature; inputting the CQCC feature or LFCC feature into the first-stage lightweight coarse-level detection model for first-stage screening so as to screen out the first-stage real audio and the first-stage generated audio, inputting the CQCC feature or LFCC feature of the first-stage generated audio into the second-stage fine-level deep identification model so as to identify the second-stage real audio and the second-stage generated audio, wherein the second-stage generated audio is a generated audio; wherein the first-stage lightweight coarse-level detection model is a lightweight convolutional model, which is constructed by convolutional neural network; and wherein the second-stage fine-level deep identification model adopts a single model system with a higher complexity or the integration of multiple models.

2. The hierarchical generated audio detection system according to claim 1, wherein inputs of the first-stage lightweight coarse-level detection model comprise: LFCC feature and a splicing feature composed of a first-order difference and a second-order difference of the LFCC feature; CQCC feature and a splicing feature composed of a first-order difference and a second-order difference of the CQCC feature.

3. The hierarchical generated audio detection system according to claim 1, wherein inputs of the second-stage fine-level deep identification model comprise: LFCC feature and a splicing feature composed of a first-order difference and a second-order difference of the LFCC feature; CQCC feature and a splicing feature composed of a first-order difference and a second-order difference of the CQCC feature.

4. The hierarchical generated audio detection system according to claim 1, wherein a particular structure of the lightweight convolution model includes 11 layers, including 3 layers of 2D convolutional layers, 7 layers of bottleneck residual block, and 1 layer of average pooling layer; real and false audio are obtained by average pooling layer and mapping to two dimensions subsequently; finally, the probability that the input audio belongs to the real and false audio is obtained through softmax operation.

5. The hierarchical generated audio detection system according to claim 4, wherein a particular method for performing the first-stage screening so as to screen out the first-stage real audio and the first-stage generated audio is as follows: for open audio data set, computing ROC (Receiver Operating Characteristic) curve to obtain the first-stage discrimination threshold; if the first-stage lightweight coarse-level detection model identifies that the input audio is generated with a probability greater than the first-stage discrimination threshold, the input audio is deemed to be the first-stage generated audio; if the first-stage lightweight coarse-level detection model identifies that the input audio is generated with a probability less than the first-stage discrimination threshold, the input audio is deemed to be the first-stage real audio, and no secondary identification is required.

6. The hierarchical generated audio detection system according to claim 1, wherein a particular structure of the second-stage fine-level deep identification model comprises two layers of two-dimensional convolution, one layer of linear mapping, one layer of position coding module, twelve layers of transformer coding layer and the last output mapping layer.

7. The hierarchical generated audio detection system according to claim 6, wherein a particular method for identifying the second-stage real audio and the second-stage generated audio is as follows: for open audio data set, computing ROC curve to obtain the second-stage discrimination threshold; if the second-stage deep fine-level identification module identifies that the first-stage generated audio is generated with a probability greater than the second-stage discrimination threshold, the first-stage generated audio is deemed to be generated audio; if the second-stage fine-level deep identification model identifies that the first-stage lightweight coarse-level detection model is generated with a probability less than the second-stage discrimination threshold, the first-stage generated audio is deemed to be real audio.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0034] The FIGURE is a structural block diagram of a hierarchical generated audio detection system provided in embodiments of the present invention.

DETAILED DESCRIPTION OF THE INVENTION

[0035] Exemplary embodiments will be described here in detail, and examples thereof are shown in the accompanying drawings. When the following description refers to the drawings, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements. Implementations described in the following exemplary embodiments do not represent all implementations consistent with the present invention; on the contrary, they are merely examples of apparatus and methods consistent with some aspects of the present invention as detailed in the appended claims.

Embodiment 1

[0036] As shown in the FIGURE, a hierarchical generated audio detection system provided by the embodiments of the present disclosure comprises the following modules.

[0037] An audio preprocessing module, a first audio feature extraction module a second audio feature extraction module, a first-stage lightweight coarse-level detection model and a second-stage fine-level deep identification model; the first-stage lightweight coarse-level detection model is a lightweight convolutional model, which is typically constructed by currently widely used Mobile Net characterized by simple structure, small parameters and less computation, so it can quickly screen a large amount of data.

[0038] An embodiment of the present disclosure adopts lightweight coarse-level detection model and the whole disclosure aims at massive data. If deep model is applied to massive data for direct identification, which will cause a catastrophic-level computation. Therefore, this present disclosure uses the lightweight model with less computation for coarse-level detection, and only performs secondary identification with the fine-level deep identification model for audio that does not meet the requirements after coarse-level detection.

[0039] In some embodiments, the particular structure of the lightweight convolutional model includes 11 layers, including 3 layers of 2D convolutional layer, 7 layers of bottleneck residual block and 1 layer of average pooling layer; the size and stride of the convolution kernel of the 3 layers of 2D convolutional layer are respectively: 13X9 convolution core (stride 7×5), 9×7 convolution core (stride 5X4) and 7×5 convolution core (stride 4×1).

[0040] After average pooling layer, it is mapped to two dimensions via linear mapping, which represent real and false audio respectively. Finally, the probability that the input audio belongs to the real and false audio is obtained through softmax operation.

[0041] Generally, the output of the second-stage fine-level deep identification model still only performs real and false identification. However, under certain circumstances, multiple types of identification can also be performed for different types of generated audio or different properties of generated audio objects. Common single models include SENet, LCNN and Transformer, etc.

[0042] In some embodiments, the particular structure of the lightweight convolutional model comprises two layers of two-dimensional convolution, one layer of linear mapping, one layer of position coding module, twelve layers of Transformer coding layer and the last output mapping layer. The probability of authentication is computed through softmax function.

[0043] The audio preprocessing module preprocesses collected audio or video data to obtain an audio clip with a length not exceeding the limit, the particular methods comprise:

[0044] normalizing the collected audio data into a monophonic audio with a sampling rate of 16K which is stored in Wav format; and then performing mute detection on the normalized audio, culling pure mute clip and saving the pure mute clip as an audio clip with a length not exceeding the limit;

[0045] as to the audio from video, firstly, using a tool to extract the audio track, and then normalizing the extracted audio data into a monophonic audio with a sampling rate of 16K which is stored in Wav format; and then performing mute detection on the normalized audio, culling pure mute clip and saving the pure mute clip as an audio clip with a length not exceeding the limit.

[0046] In some embodiments, the first audio feature extraction module is a CQCC feature extraction module or an LFCC feature extraction module.

[0047] In some embodiments, the second audio feature extraction module is a CQCC feature extraction module or an LFCC feature extraction module.

[0048] The methods may further comprise: inputting the audio clip into the CQCC feature extraction module and the LFCC feature extraction module respectively to obtain CQCC feature and LFCC feature.

[0049] The input of the first-stage lightweight coarse-level detection mode further comprises:

[0050] inputting the LFCC feature and the splicing feature composed of the first-order difference and the second-order difference of the LFCC feature, or the CQCC feature and the splicing feature composed of the first-order difference and the second-order difference of the CQCC feature, into the first-stage lightweight coarse-level detection model for the first-stage screening to screen out the first-stage real audio and the first-stage generated audio, the particular method thereof is as follows: for open audio data set, computing ROC curve to get a first-stage discrimination threshold such as 0.5, if the first-stage lightweight coarse-level detection model identifies that the input audio is generated with a probability greater than the first-stage discrimination threshold, the input audio is deemed to be the first-stage generated audio. If the first-stage lightweight coarse-level detection model identifies that the input audio is generated with a probability less than the first-stage discrimination threshold, the input audio is deemed to be the first-stage real audio, and no second-stage identification is required for the first-stage real audio but the first-stage generated audio needs a second-stage identification;

[0051] inputting the LFCC feature of the first-stage generated audio and the splicing feature composed of the first-order difference and the second-order difference of the LFCC feature, or the CQCC feature and the splicing feature composed of the first-order difference and the second-order difference of the CQCC feature, into the second-stage fine-level deep identification model to screen out the second-stage real audio and the second-stage generated audio, the second-stage generated audio is identified as generated audio, the particular method thereof is as follows: for open audio data set, computing ROC curve to obtain a second-stage discrimination threshold, if the second-stage fine-level deep identification model identifies that the first-stage generated audio is real with a probability greater than the second-stage discrimination threshold, the first-stage generated audio is deemed to be generated audio, if the second-stage fine-level deep identification model identifies that the first-stage generated audio is generated with a probability less than the second-stage discrimination threshold, the first-stage generated audio is deemed to be real audio.

Embodiment 2

[0052] The first-stage lightweight coarse-level detection model is constructed by MobileMetV2 with its model structure having 11 layers, including 3 layers of 2D convolutional layer, 7 layers of bottleneck residual block and 1 layer of average pooling layer. The parameter of the model is about 5M. The first-stage lightweight coarse-level detection model uses LFCC feature and the splicing feature composed of the first-order and second-order differences (60 dimensions in total) of the LFCC feature as input; inputting a clip with the audio pseudo length of 20 seconds (fill with 0 if less than 20 seconds and truncate if more than 20 seconds). The model input contains only one channel while the output contains two nodes, representing the real and false respectively.

[0053] The second-stage fine-level deep identification model is constructed by Transformer model. From the bottom layer to the top layer, the fine-level deep identification model includes two layers of two-dimensional convolutional layer, one layer of linear mapping, one layer of position coding module, twelve layers of Transformer coding layer and the last output mapping layer. The overall parameter of the model is about 20M, wherein, the convolutional layer is set with a stride of 2, so it is equivalent to 4 times sequential downsampling through the convolutional layer. The fine-level deep identification model uses LFCC feature and the splicing feature composed of the first-order and second-order differences (60 dimensions in total) of the LFCC feature as input. The output of the last output mapping layer is of two types, indicating the real and false respectively.

[0054] The model is divided into two stages during identification progress. In the first stage, the lightweight convolution model is used to roughly identify massive audio, the audio with generation probability less than 0.5 is directly skipped, and the audio with generation probability greater than 0.5 has a secondary identification with fine-level deep identification model. For the audio undergoing a secondary identification, the secondary identification result will be final identification result.

Embodiment 3

[0055] The generated audio has diverse types, typically including playback, neural synthesis, splicing and so on. In view of the classification identification of massive data, the audio detection system is generated by using hierarchical and multi-classification big data.

[0056] The first-stage lightweight coarse-level detection model is constructed by MobileMetV2 with its model structure having 11 layers, including 3 layers of 2D convolutional layer, 7 layers of bottleneck residual block and 1 layer of average pooling layer. The parameter of the model is about 5M. The first-stage lightweight coarse-level detection model uses LFCC feature and the splicing feature composed of the first-order and second-order differences (60 dimensions in total) of the LFCC feature as input; inputting a clip with the audio pseudo length of 20 seconds (fill with 0 if less than 20 seconds and truncate if more than 20 seconds). The model input contains only one channel while the output contains two nodes, indicating the real and false respectively.

[0057] The second-stage fine-level deep identification model is constructed by Transformer model. From the bottom layer to the top layer, the fine-level deep identification model includes two layers of two-dimensional convolutional layer, one layer of linear mapping, one layer of position coding module, twelve layers of Transformer coding layer and the last output mapping layer. The overall parameter of the model is about 20M, wherein, the convolutional layer is set with a stride of 2, so it is equivalent to 4 times sequential downsampling through the convolutional layer. The fine-level deep identification model uses LFCC feature and the splicing feature composed of the first-order and second-order differences (60 dimensions in total) of the LFCC feature as input. The output of the last output mapping layer is of four types, indicating real audio, replay, splicing and neural synthesis respectively.

[0058] The model is divided into two stages during identification progress. In the first stage, the lightweight model is used to roughly identify massive audio, the audio identified with a generation probability less than 0.5 is directly skipped, and the audio with generation probability greater than 0.5 has a secondary identification with fine-level deep identification model. In the process of secondary identification, the authenticity and generation type of the model are identified simultaneously.

[0059] The terms used in this present invention are intended solely to describe particular embodiments and are not intended to limit the invention. The singular forms “one”, “the” and “this” used in the present invention and the appended claims are also intended to include the plural forms, unless the context clearly indicates otherwise. It should also be understood that the terms “and/or” used herein refer to and include any or all possible combinations of one or more associated listed items.

[0060] It should be understood that although the terms first, second, third, etc. may be used to describe information in the present invention, such information should not be limited to those terms. Those terms are only used to distinguish the same type of information from one another. For example, without departing from the scope of the present invention, the first information may also be referred to as the second information, and similarly vice versa. Depending on the context, the word "if' as used herein can be interpreted as "while" or "when" or "in response to certain cases".

[0061] Embodiments of the disclosed subject matter and functional operations described in this specification may be implemented in digital electronic circuits, tangible computer software or firmware, computer hardware including the structures disclosed in this specification and their structural equivalents, or a combination of one or more of them. Embodiments of the subject matter described herein may be implemented as one or more computer programs, that is, one or more modules of computer program instructions encoded on a tangible non-transitory program carrier to be executed by the data processing device or to control the operation of the data processing device. Alternatively or additionally, program instructions may be encoded on manually generated propagation signals, such as electrical, optical or electromagnetic signals generated by machine, which are generated to encode and transmit information to a suitable receiver for execution by the data processing device. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.

[0062] The processing and logic flow described herein can be executed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating according to input data and generating output. The processing and logic flow can also be executed by an application specific logic circuit, such as FPGA (field programmable gate array) or ASIC (application specific integrated circuit), and the apparatus can also be implemented as an application specific logic circuit.

[0063] Computers suitable for executing computer programs comprise, for example, general-purpose and/or special-purpose microprocessors, or any other type of central processing unit. Generally, the central processing unit receives instructions and data from read-only memory and/or random access memory. The basic components of a computer comprise a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data. Generally, the computer further comprises one or more mass storage devices for storing data, such as magnetic disk, magneto-optical disk or optical disk, or the computer is operatively coupled with the mass storage device to receive data from or transmit data to it, or both. However, this device is not a necessity for a computer. Additionally, the computer may be embedded in another device, such as a mobile phone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a global positioning system (GPS) receiver, or a portable storage device such as a universal serial bus (USB) flash drive, just to name a few.

[0064] Computer readable media suitable for storing computer program instructions and data include all forms of nonvolatile memory, media and memory devices, for example, semiconductor memory devices (such as EPROM, EEPROM and flash memory devices), magnetic disks (such as internal HDD or removable disks), magneto-optical disks, and CD-ROM and DVD-ROM disks. The processor and memory may be supplemented by or incorporated into an application specific logic circuit.

[0065] Although this specification contains many particular embodiments, these should not be construed to limit the scope of any invention or the scope of protection claimed, but are intended primarily to describe the characteristics of specific embodiments of a particular invention. Some of the features described in multiple embodiments in this specification may also be implemented in combination in a single embodiment. On the other hand, features described in a single embodiment may also be implemented separately in multiple embodiments or in any suitable subcombination. In addition, although features may function in certain combinations as described above and even initially claimed as such, one or more features from the claimed combination may in some cases be removed from the combination, and the claimed combination can be directed to a sub-combination or a variant of the sub-combination.

[0066] Similarly, although operations are described in a particular order in the drawings, this should not be construed as requiring these operations to be performed in the particular order or sequence as shown, or requiring all illustrated operations to be performed to achieve the desired results. In some cases, multitasking and parallel processing may be advantageous. In addition, the separation of various system modules and components in the above embodiments should not be construed as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or encapsulated into multiple software products.

[0067] Thus, specific embodiments of the subject matter have been described. Other embodiments are within the scope of the appended claims. In some cases, the actions described in the claims can be executed in different orders and still achieve the desired results. In addition, the processes described in the drawings do not have to be in the particular order or sequential order as shown to achieve the desired results. In some implementations, multitasking and parallel processing may be advantageous.

[0068] The description above are only the preferred embodiments of the present invention and are not intended to limit the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included in the scope of protection of the invention.

HIERARCHICAL GENERATED AUDIO DETECTION SYSTEM

Assignee

Inventors

Cpc classification

Classification Explorer

G10L25/30

PHYSICS

Classification Explorer

G10L25/54

PHYSICS

Classification Explorer

G10L25/24

PHYSICS

International classification

Classification Explorer

G10L25/24

PHYSICS

Classification Explorer

G10L25/30

PHYSICS

Classification Explorer

G10L25/57

PHYSICS

Abstract

Claims

Description