METHOD FOR DETECTING AN AUDIO ADVERSARIAL ATTACK WITH RESPECT TO A VOICE COMMAND PROCESSED BYAN AUTOMATIC SPEECH RECOGNITION SYSTEM, CORRESPONDING DEVICE, COMPUTER PROGRAM PRODUCT AND COMPUTER-READABLE CARRIER MEDIUM

Abstract

A method and device for detecting an audio adversarial attack with respect to a voice command processed by an automatic speech recognition system is described. The method is implemented by a detection device connected to the automatic speech recognition system and includes obtaining an audio signal associated with the voice command, performing a phonetic transcription of the audio signal, according to a phonetic transcription scheme, delivering a first character string; obtaining a transcript resulting from the processing, by the automatic speech recognition system, of the audio signal, performing a phonetic transcription of the transcript, according to the phonetic transcription scheme, delivering a second character string, computing a similarity score between the first character string and the second character string, and delivering a piece of data representative of a detection of an audio adversarial attack, as a function of a result of a comparison between the similarity score and a predetermined threshold.

Claims

1. A method for detecting an audio adversarial attack with respect to a voice command (VC) processed by an automatic speech recognition system (ASR), the method being implemented by a detection device connected to the automatic speech recognition system, wherein the method comprises: obtaining an audio signal (AS) associated with the voice command; performing a phonetic transcription of the audio signal, according to a phonetic transcription scheme, delivering a first character string (CS1); obtaining a transcript resulting from the processing, by the automatic speech recognition system, of the audio signal; performing a phonetic transcription of the transcript, according to the phonetic transcription scheme, delivering a second character string (CS2); computing a similarity score (SS) between the CS1 and the CS2; and delivering a piece of data representative of a detection of an audio adversarial attack, as a function of a result of a comparison between the SS and a predetermined threshold.

2. The method according to claim 1, wherein the method further comprises performing a homogenization process on the CS1 and on the CS2, before computing the similarity score between the CS1 and the CS2.

3. The method according to claim 2, wherein the homogenization process comprises removing, from the CS1 and from the CS2, space characters and/or symbols associated with a silence according to the phonetic transcription scheme.

4. The method according to claim 1, wherein delivering a piece of data representative of a detection of an audio adversarial attack further takes into account a result of a comparison between the CS1 and the CS2 based on at least one additional metric.

5. The method according to claim 4, wherein the comparison based on at least one additional metric belongs to the group comprising: a comparison of the number of syllables; a comparison of the number of silences; a comparison of the number of segments; and a comparison of the number of words.

6. The method according to claim 1, wherein obtaining the audio signal and performing a phonetic transcription of the audio signal, and obtaining the transcript and performing a phonetic transcription of the transcript are processed in parallel by the detection device.

7. The method according to claim 1, wherein the method further comprises transmitting the piece of data representative of a detection of an audio adversarial attack to a communication device in charge of executing an action associated with the voice command.

8. The method according to claim 1, wherein computing the SS between the first character string and the second character string is performed by using an algorithm belonging to the group comprising: a Levenshtein distance calculation algorithm; a NeedlemanWunch algorithm; a Smith-Waterman algorithm; a Jaro distance calculation algorithm; a Jaro Winkler distance calculation algorithm; a QGrams distance calculation algorithm; and a Chapman Length Deviation algorithm.

9. The method according to claim 1, wherein the phonetic transcription scheme belongs to the group comprising: an ARPABET phonetic transcription scheme; a SAMPA phonetic transcription scheme; and a X-SAMPA phonetic transcription scheme.

10. A detection device for detecting an audio adversarial attack with respect to a voice command processed by an automatic speech recognition system, the detection device being connected to the automatic speech recognition system, wherein the detection device comprises at least one processor configured to: obtain an audio signal associated with the voice command; perform a phonetic transcription of the audio signal, according to a phonetic transcription scheme, delivering a first character string; obtain a transcript resulting from the processing, by the automatic speech recognition system, of the audio signal; perform a phonetic transcription of the transcript, according to the phonetic transcription scheme, delivering a second character string; compute a similarity score between the first character string and the second character string; and deliver a data representative of a detection of an audio adversarial attack, as a function of a result of a comparison between the similarity score and a predetermined threshold.

11. (canceled)

12. (canceled)

13. (canceled)

14. A non-transitory computer-readable medium comprising a computer program product recorded thereon, the computer program product comprising instructions which, when the program is executed by a processor, cause the processor to carry out the steps of: obtaining an audio signal associated with the voice command; performing a phonetic transcription of the audio signal, according to a phonetic transcription scheme, delivering a first character string; obtaining a transcript resulting from the processing by the automatic speech recognition system, of the audio signal; performing a phonetic transcription of the transcript, according to the phonetic transcription scheme, delivering a second character string; computing a similarity score between the first character string and the second character string; and delivering a piece of data representative of a detection of an audio adversarial attack, as a function of a result of a comparison between the similarity score and a predetermined threshold.

15. The detection device of claim 10, wherein the at least one processor is further configured to perform a homogenization process on the first character string and on the second character string, before computing the similarity score between the first character string and the second character string.

16. The detection device of claim 15, wherein the homogenization process comprises removing, from the first character string and from the second character string, space characters and/or symbols associated with a silence according to the phonetic transcription scheme.

17. The detection device of claim 10, wherein delivering a piece of data representative of a detection of an audio adversarial attack further takes into account a result of a comparison between the first character string and the second character string based on at least one additional metric.

18. The detection device of claim 17, wherein the comparison based on at least one additional metric belongs to the group comprising: a comparison of the number of syllables; a comparison of the number of silences; a comparison of the number of segments; and a comparison of the number of words.

19. The detection device of claim 10, wherein obtaining the audio signal and performing a phonetic transcription of the audio signal, and obtaining the transcript and performing a phonetic transcription of the transcript are processed in parallel by the detection device.

20. The detection device of claim 10, wherein the at least one processor is further configured to transmit the piece of data representative of a detection of an audio adversarial attack to a communication device in charge of executing an action associated with the voice command.

21. The detection device of claim 10, wherein computing the similarity score between the first character string and the second character string is performed by using an algorithm belonging to the group comprising: a Levenshtein distance calculation algorithm; a NeedlemanWunch algorithm; a Smith-Waterman algorithm; a Jaro distance calculation algorithm; a Jaro Winkler distance calculation algorithm; a QGrams distance calculation algorithm; and a Chapman Length Deviation algorithm.

22. The detection device of claim 10, wherein the phonetic transcription scheme belongs to the group comprising: an ARPABET phonetic transcription scheme; a SAMPA phonetic transcription scheme; and a X-SAMPA phonetic transcription scheme.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0030] Embodiments of the present disclosure can be better understood with reference to the following description and drawings, given by way of example and not limiting the scope of protection, and in which:

[0031] FIG. 1 is a flow chart for illustrating the general principle of the proposed technique for detecting an audio adversarial attack with respect to a voice command processed by an automatic speech recognition system, according to an embodiment of the present disclosure;

[0032] FIGS. 2a and 2b show an example of how the proposed technique makes it possible to differentiate between a situation where a voice command is not targeted by an audio adversarial attack (FIG. 2a) and a situation where the same voice command is targeted by an audio adversarial attack (FIG. 2b), according to an embodiment of the present disclosure;

[0033] FIG. 3 is a schematic block diagram illustrating an example of a detection device for detecting an audio adversarial attack with respect to a voice command processed by an automatic speech recognition system, according to an embodiment of the present disclosure; and

[0034] FIGS. 4a, 4b and 4c show different configurations for the location of a detection device, according to various embodiments of the present disclosure.

[0035] The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the disclosure.

DETAILED DESCRIPTION

[0036] The present disclosure relates to a method for detecting an audio adversarial attack with respect to a voice command processed by an automatic speech recognition system. As it will be described more fully hereafter with reference to the accompanying figures, the proposed technique is easy to implement, machine-learning-system-agnostic (i.e. independent of the machine learning architecture on which the automatic speech recognition system is based) and it makes it possible to determine in an effective way and at a low computational cost whether or not a voice command has been hacked and turned into an adversarial example. The detection may be achieved within a very short period of time, thus allowing preventing a malicious command associated with an adversarial example from being executed. This objective is reached, according to the general principle of the disclosure, by comparing character strings resulting from the phonetic transcriptions of a voice command, before and after it has been processed by a machine-learning-based automatic speech recognition system.

[0037] This disclosure may, however, be embodied in many alternate forms and should not be construed as limited to the embodiments set forth herein. Accordingly, while the disclosure is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that there is no intent to limit the disclosure to the particular forms disclosed, but on the contrary, the disclosure is to cover all modifications, equivalents, and alternatives falling within the scope of the disclosure as defined by the claims. In the drawings, like or similar elements are designated with identical reference signs throughout the several views thereof.

[0038] While not explicitly described, the present embodiments and variants may be employed in any combination or sub-combination.

[0039] FIG. 1 is a flow chart for describing a method for detecting an audio adversarial attack with respect to a voice command VC processed by a machine-learning-based automatic speech recognition system ASR (such as for example a neural-network-based automatic speech recognition system), according to an embodiment of the present disclosure. The method is implemented by a detection device connected to the automatic speech recognition system ASR, either directly or through a communication device such as a communication device intended to execute the voice command for example. The detection device, which is further detailed in one embodiment later in this document, includes at least one processor adapted and configured for carrying out the steps described hereafter.

[0040] At step 11, the detection device obtains an audio signal AS associated with the voice command VC. The audio signal AS corresponds to the signal provided as an input of the automatic speech recognition system ASR for the processing of the voice command VC. The audio signal AS may for example be obtained from a microphone connected to or embedded in the detection device itself, or it may be received from a communication device intended to process the voice command VC along with the automatic speech recognition system ASR. By “audio signal associated with the voice command”, it is meant here that the generation of the audio signal AS is linked to the voice command VC. In the typical case where the voice command VC is not subjected to an audio adversarial attack, the audio signal AS corresponds to a recording of the voice command VC (along with possible presence of a benign background noise). However, in case of an audio adversarial attack, the audio signal AS corresponds to a mix between the voice command VC and a more or less imperceptible malicious noise specifically designed by an attacker to mislead the machine-learning-based automatic speech recognition system. At this stage, such an attack has not been detected yet.

[0041] At step 12, the detection device performs a phonetic transcription of the audio signal AS, according to a phonetic transcription scheme. More particularly, according to an embodiment, the audio signal AS is sampled into audio samples that are then automatically converted to phonemes by using a phoneme dictionary associated with the considered phonetic transcription scheme. For example, ARPABET, SAMPA, or X-SAMPA may be used as phonetic transcription schemes suitable for processing the audio signal AS. During the processing carried out at step 12, no semantic or syntactic constraints are taken into consideration. According to an embodiment, this processing relies only on basic signal processing operations, and doesn't involve the use of a machine-learning-based system. Step 12 delivers a character string, referred to as a first character string CS1.

[0042] At step 13, the detection device obtains a transcript T resulting from the processing, by the automatic speech recognition system, of the audio signal AS. Depending on where the detection device is located, this transcript T may be obtained directly from the automatic speech recognition system, or it may be received through a communication device. In the case where the voice command VC is not the target of an audio adversarial attack, the output of the automatic speech recognition system is normally representative of a word for word transcript (or at least of a rather close word for word transcript) of the voice command VC as originally spoken by the user of the automatic speech recognition system. However, in case of an audio adversarial attack, the machine-learning-based system ruling the automatic speech recognition system is misled and outputs a transcript T that is not representative of the voice command VC. Depending on the attack, the transcript T may even be representative of a totally different command than the original one.

[0043] At step 14, a phonetic transcription of the transcript T delivered by the automatic speech recognition system is performed by the detection device, using the same phonetic transcription scheme than the one used at step 12. Phonetic transcriptions performed at steps 12 and 14 differ in that the one carried out at step 12 takes an audio signal (the audio signal AS) as an input whereas the one carried out at step 14 takes a text (the transcript T) as an input. However, as indicated above, both rely on the same phonetic transcription scheme. Step 14 delivers a character string, referred to as a second character string CS2.

[0044] Groups of steps 11 and 12 on the one hand and steps 13 and 14 on the other hand may be processed one after the other, whatever the order. However, according to an embodiment, considering the time needed by the automatic speech recognition system to process the audio signal, group of steps 13 and 14 may be processed after group of steps 11 and 12. According to a preferred embodiment, these two groups of steps (or at least some of their steps) are processed in parallel in order to save computing time.

[0045] At step 15, a similarity score SS between character strings CS1 and CS2 is computed. Various string-matching algorithms may be used to compute the similarity score SS, such as, for example, a Levenshtein distance calculation algorithm, a NeedlemanWunch algorithm, a Smith-Waterman algorithm, a Jaro distance calculation algorithm, a Jaro Winkler distance calculation algorithm, a QGrams distance calculation algorithm, a Chapman Length Deviation algorithm, etc.

[0046] According to an embodiment, a homogenization process is carried out on both character strings CS1 and CS2, before computing the similarity score SS. More particularly, the homogenization process may consist of removing, from the character strings CS1 and CS2, particular characters (including, for example, characters representative of specific annotations that are not part of the phoneme dictionary associated with the phonetic transcription scheme) and/or sequence of characters having a special meaning according to the phonetic transcription scheme. For example, according to an embodiment, the homogenization process comprises removing, from the first character string CS1 and from the second character string CS2, space characters and/or symbols associated with a silence according to the phonetic transcription scheme. Such a homogenization process may prove useful in alleviating the differences in the form that may result from the fact that character strings CS1 and CS2 are delivered respectively from different phonetic transcription processes that, though relying on a same phonetic transcription scheme, may not behave exactly the same. Furthermore, it allows taking into account the fact that silences and speech interruptions that may be present in the original voice command may be ignored and/or lost during the processing performed by the machine-learning-based system of the automatic speech recognition system.

[0047] At step 16, the computed similarity score SS is compared to a predetermined threshold, and a piece of data representative of whether or not an audio adversarial attack is detected is delivered as a function of the result of this comparison. Indeed, the similarity score makes it possible to quantify or at least estimate how much the voice command has been altered when processed by the automatic speech recognition system ASR. When no audio adversarial attack is going on, the transcript outputted from the automatic speech recognition system ASR is normally a rather close word for word transcript of the original voice command VC, and character strings CS1 and CS2 are thus quite similar. On the contrary, in presence of an audio adversarial attack, the command corresponding to the transcript outputted by the automatic speech recognition system has a high probability to be quite different from the original command, which results in character strings CS1 and CS2 being quite different too. According to an embodiment, the similarity score is a mathematical distance (such as the Levenshtein distance for example), and an audio adversarial attack with respect to the voice command is assumed to be going on if the computed distance is above the predetermined threshold. The piece of data representative of a detection of an audio adversarial attack may take the form of a boolean representing an attack status, which is set to true if an attack is detected and false otherwise.

[0048] According to an embodiment, in order to enhance adversarial attack detection, step 16 for delivering a piece of data representative of a detection of an audio adversarial attack may take into account at least one additional metric, in addition to the similarity score previously described. For example, the result of a comparison of a number of syllables, a number of silences, a number of segments (i.e. portions of speech between silences) and/or a number of words may also be taken into account. Comparisons based on these additional metrics may be performed between character strings CS1 and CS2 themselves, possibly before homogenization (e.g. for a comparison based on the number of syllables, segments and/or silences), or at a higher level, between the audio signal AS inputted in the automatic speech recognition system and the transcript T outputted from the automatic speech recognition system for example (e.g. for a comparison based on the number of syllables, and/or words). According to an embodiment, such comparisons based on at least one additional metrics are performed after the above-described comparison between the similarity score and a predetermined threshold, and only if said comparison based on the similarity score has not resulted in the detection of an adversarial attack. In such a case, a piece of data representative of the presence of an audio adversarial attack (attack status set to true) can still be delivered, if the comparisons based on the additional metrics highlight a different number of syllables, silences, segments and/or words between the compared items.

[0049] According to an embodiment, the method further comprises transmitting the piece of data representative of a detection of an audio adversarial attack to a communication device initially intended to execute the action associated with the original voice command VC. In that way, the communication device may be warned when an attack is detected, and therefore be in position to block the execution of the malicious command which has replaced the original command as an effect of the adversarial attack.

[0050] FIGS. 2a and 2b illustrate more precisely an example of how the technique described in relation with FIG. 1 makes it possible to detect an audio adversarial attack. More particularly, FIG. 2a describes a situation in which no audio adversarial attack is going on, whereas FIG. 2b describes a situation in which an audio adversarial attack is going on, with respect to a same voice command VC. In both examples, the voice command VC as spoken by a user is the following “the more she is engaged in her proper duties”. It should be understood that this sentence is only used as an illustrative and non-limitative example to describe the general principle of the proposed technique, which of course remains the same with another sentence that may be considered as more representative of a command, such as for example “call the school”, or “set a timer to five minutes”. In the examples of FIGS. 2a and 2b, ARPABET is used as the phonetic transcription scheme to generate character strings CS1 and CS2, and the Levenshtein distance is used as a similarity score to compare character strings CS1 and CS2 (the more the distance is, the less character strings CS1 and CS2 are similar).

[0051] In the situation depicted on FIG. 2a, since no audio adversarial attack is going on, no malicious noise has been added to the voice command VC. The automatic speech recognition system ASR thus processes an audio signal AS which corresponds to the voice command VC, and delivers as a result a word for word transcript T of the voice command VC. The ARPABET phonetic transcriptions of the audio signal AS on the one hand and of the transcript T on the other hand respectively deliver character strings CS1 and CS2, which go through a homogenization process where spaces and symbol “SIL” (the ARPABET abbreviation for a silence) are deleted. The Levenshtein distance D between homogenized character strings CS1 and CS2 is then computed, giving a result of D=12 in the example of FIG. 2a.

[0052] In the situation depicted on FIG. 2b, an audio adversarial attack is going on, and a malicious noise PT is added by an attacker to the original voice command VC, without the user noticing it. As a result, the audio signal AS doesn't correspond to the voice command VC, but instead to a mix between voice command VC and malicious noise PT. However, malicious noise PT may have been designed so that the audio signal AS sounds the same than the original voice command VC to a human ear. Because of the presence of this malicious noise PT, the automatic speech recognition system ASR is misled and output a transcript T corresponding to the command “hello”, which has no longer anything to do with the original voice command VC as spoken by the user (here again, outputted command “hello” is only used as an illustrative and non-limitative example that may sounds quite harmless, but it should be understood that the malicious noise may have been specifically constructed so that the fooled machine-learning-based system, e.g. a deep neural network, outputs another command that may cause serious security problems, such as “open the front door” for example). Similarly to operations already described in relation with FIG. 2a, ARPABET phonetic transcriptions of the audio signal AS on the one hand and of the transcript T on the other hand are performed, respectively delivering character strings CS1 and CS2, which then go through a homogenization process, before the Levenshtein distance D between homogenized character strings CS1 and CS2 is finally computed, giving a result of D=29 in the example of FIG. 2b. As one can note, the distance computed in the example of FIG. 2b is significantly higher than the distance computed in the example of FIG. 2a, thus demonstrating how such a distance can be used as a detection criterion for detecting an audio adversarial attack.

[0053] The examples of FIGS. 2a and 2b thus illustrate how the proposed technique makes it possible to detect an audio adversarial attack in a simple and efficient manner, which is furthermore not dependent on the machine learning architecture used by the automatic speech recognition system, simply by computing a similarity score between well-targeted character strings, and comparing this similarity score to a predetermined threshold.

[0054] FIG. 3 shows a schematic block diagram illustrating an example of a detection device DD for detecting an audio adversarial attack with respect to a voice command processed by an automatic speech recognition system, according to an embodiment of the present disclosure. As illustrated in relation with FIG. 4a. 4b and 4c, the detection device DD may be deployed locally or located in a cloud infrastructure. In some embodiments, the detection device DD is connected to (as a standalone device, as depicted for example on FIG. 4a) or embedded into (as a component, as depicted for example on FIG. 4b) a communication device CD configured for processing voice commands together with a machine-learning-based automatic speech recognition system ASR. The communication device CD may be for example a smartphone, a tablet, a computer, a speaker, a set-top box, a television set, a home gateway, etc., embedding voice recognition features. The automatic speech recognition system ASR may be implemented as a component of the communication device CD itself (as depicted on FIG. 4b), or, alternatively, be located in the cloud and accessible over a communication network, as a mutualised resource shared between a plurality of communication devices (as depicted on FIG. 4a or 4c, for example). In another embodiment, depicted on FIG. 4c, the detection device DD is implemented on a cloud infrastructure service, alongside with a distant automatic speech recognition service for example. Whatever the embodiment considered, the detection device DD is connected to an automatic speech recognition system, either directly or indirectly through a communication device.

[0055] Referring back to FIG. 3, the detection device DD includes a processor 301, a storage unit 302, an input device 303, an output device 304, and an interface unit 305 which are connected by a bus 306. Of course, constituent elements of the device DD may be connected by a connection other than a bus connection using the bus 306.

[0056] The processor 301 controls operations of the detection device DD. The storage unit 302 stores at least one program to be executed by the processor 301, and various data, including for example parameters used by computations performed by the processor 301, intermediate data of computations performed by the processor 301 such as the first and second character strings obtained as an output of the phonetic transcriptions steps, and so on. The processor 301 is formed by any known and suitable hardware, or software, or a combination of hardware and software. For example, the processor 301 is formed by dedicated hardware such as a processing circuit, or by a programmable processing unit such as a CPU (Central Processing Unit) that executes a program stored in a memory thereof.

[0057] The storage unit 302 is formed by any suitable storage or means capable of storing the program, data, or the like in a computer-readable manner. Examples of the storage unit 302 include non-transitory computer-readable storage media such as semiconductor memory devices, and magnetic, optical, or magneto-optical recording media loaded into a read and write unit. The program causes the processor 301 to perform a method for detecting an audio adversarial attack with respect to a voice command processed by an automatic speech recognition system according to an embodiment of the present disclosure as described previously. More particularly, the program causes the processor 301 to perform phonetic transcriptions of the audio signal provided as an input of the automatic speech recognition system on the one hand and of the transcript delivered as an output of the automatic speech recognition system on the other hand, and to compute a similarity score between the two character strings resulting from these phonetic transcriptions.

[0058] The input device 303 is formed for example by a microphone.

[0059] The output device 304 is formed for example by a processing unit configured to take decision regarding whether or not an audio adversarial attack is considered as detected, as a function of the result of the comparison between the computed similarity score and a predetermined threshold.

[0060] The interface unit 305 provides an interface between the detection device DD and an external apparatus and/or system. The interface unit 305 is typically a communication interface allowing the detection device to communicate with an automatic speech recognition system and/or with a communication device, as already presented in relation with FIGS. 4a, 4b and 4c. The interface unit 305 may be used to obtain the audio signal provided as an input of the automatic speech recognition system and the transcript delivered as an output of the automatic speech recognition system. The interface unit 305 may also be used to transmit an attack status to the automatic speech recognition system and/or to a communication device expected to execute a voice command.

[0061] Although only one processor 301 is shown on FIG. 3, it must be understood that such a processor may include different modules and units embodying the functions carried out by device DD according to embodiments of the present disclosure. These modules and units may also be embodied in several processors 301 communicating and co-operating with each other.

[0062] While the disclosure has been described with reference to exemplary embodiments, it is to be understood that the disclosure can be embodied in various forms, and is not to be limited to the examples discussed above. More particularly, the proposed technique may be applied to voice data that are not necessary voice commands as such, in the field of speech-to-text systems for example.

METHOD FOR DETECTING AN AUDIO ADVERSARIAL ATTACK WITH RESPECT TO A VOICE COMMAND PROCESSED BYAN AUTOMATIC SPEECH RECOGNITION SYSTEM, CORRESPONDING DEVICE, COMPUTER PROGRAM PRODUCT AND COMPUTER-READABLE CARRIER MEDIUM

Inventors

Cpc classification

Classification Explorer

G10L15/22

PHYSICS

Classification Explorer

G10L15/34

PHYSICS

Classification Explorer

G10L15/08

PHYSICS

Classification Explorer

G10L2015/223

PHYSICS

International classification

Classification Explorer

G10L15/08

PHYSICS

Classification Explorer

G10L15/22

PHYSICS

Classification Explorer

G10L15/34

PHYSICS

Abstract

Claims

Description