METHOD AND APPARATUS FOR RECOGNIZING ACOUSTIC ANOMALIES

20220358952 · 2022-11-10

    Inventors

    Cpc classification

    International classification

    Abstract

    A method for detecting anomalies has the following steps:

    Obtaining a long-term recording having a plurality of first audio segments associated to respective first time windows; analyzing the plurality of the first audio segments to obtain, for each of the plurality of the first audio segments, a first characteristic vector describing the respective first audio segment; obtaining a further recording having one or more second audio segments associated to respective second time windows; analyzing the one or more second audio segments to obtain one or more characteristic vectors describing the one or more second audio segments ABCD; matching the one or more second characteristic vectors with the plurality of the first characteristic vectors to recognize at least one anomaly, like a temporal, sound or spatial anomaly.

    Claims

    1. A method for recognizing acoustic anomalies, comprising: obtaining a long-term recording having a plurality of first audio segments associated to respective first time windows; analyzing the plurality of the first audio segments to obtain, for each of the plurality of the first audio segments, a first characteristic vector describing the respective first audio segment; obtaining a further recording having one or more second audio segments associated to respective second time windows; analyzing the one or more second audio segments to obtain one or more characteristic vectors describing the one or more second audio segments ABCD; and matching the one or more second characteristic vectors with the plurality of the first characteristic vectors to recognize at least one anomaly when compared to an acoustic normal situation for this environment.

    2. The method in accordance with claim 1, wherein the anomaly comprises a sound, temporal and/or spatial anomaly; and/or wherein the anomaly comprises a sound anomaly in combination with a temporal anomaly or a sound anomaly in combination with a spatial anomaly or a temporal anomaly in combination with a spatial anomaly.

    3. The method in accordance with claim 1, the method, when analyzing, comprising the sub-step of identifying a repetition pattern in the plurality of the first time windows.

    4. The method in accordance with claim 3, wherein identifying is performed using repeating, identical or similar first characteristic vectors belonging to different first audio segments.

    5. The method in accordance with claim 3, wherein, when identifying, grouping of identical or similar first characteristic vectors to form one or more groups is performed.

    6. The method in accordance with claim 1, the method comprising recognizing an order of first characteristic vectors belonging to different first audio segments or recognizing an order of groups of identical or similar first characteristic vectors.

    7. The method in accordance with claim 3, the method comprising identifying a repetition pattern in the one or more second time windows; and/or the method comprising recognizing an order of second characteristic vectors belonging to different second audio segments or recognizing an order of groups of identical or similar second characteristic vectors.

    8. The method in accordance with claim 7, the method comprising the sub-step of matching the repetition pattern of the first audio segments and/or order in the first audio segments with the repetition pattern of the second audio segments and/or order in the second audio segments in order to recognize a temporal anomaly.

    9. The method in accordance with claim 1, wherein matching comprises the sub-step of identifying a second characteristic vector, which differs from the first characteristic vectors analyzed, in order to recognize a sound anomaly.

    10. The method in accordance with claim 1, wherein the characteristic vector comprises one dimension, more dimensions or a reduced dimension space; and/or wherein the method comprises the step of reducing the dimensions of the characteristic vector.

    11. The method in accordance with claim 1, the method comprising the step of determining a respective position for the respective first audio segments.

    12. The method in accordance with claim 11, the method comprising the step of determining a respective position for the respective second audio segments, and the method comprising the sub-step of matching the position associated to the respective first audio segment with the position associated to the corresponding respective second audio segment in order to recognize a spatial anomaly.

    13. The method in accordance with claim 1, the method comprising the step of determining a probability of occurrence of the respective first audio segment and outputting the probability of occurrence with the respective first characteristic vector, or the method comprising the step of determining a probability of occurrence of the respective first audio segment and outputting the probability of occurrence with the respective first characteristic vector and a first time window.

    14. The method in accordance with claim 1, wherein the plurality of the first audio segments and/or the plurality of the first audio segments in their order describe an acoustic normal state in the application scenario and/or represent a reference; and/or wherein the one anomaly is recognized when one or more second characteristic vectors deviate from the plurality of the first characteristic vectors.

    15. The method in accordance with claim 1, wherein the long-term recording comprises at least a duration of 10 minutes or at least 1 hour or at least 24 hours; and/or wherein the further recoding comprises a time window or, in particular, a time window of less than 5 minutes, less than 1 minute, or less than 10 seconds.

    16. A non-transitory digital storage medium having stored thereon a computer program for performing a method for recognizing acoustic anomalies, comprising: obtaining a long-term recording having a plurality of first audio segments associated to respective first time windows; analyzing the plurality of the first audio segments to obtain, for each of the plurality of the first audio segments, a first characteristic vector describing the respective first audio segment; obtaining a further recording having one or more second audio segments associated to respective second time windows; analyzing the one or more second audio segments to obtain one or more characteristic vectors describing the one or more second audio segments ABCD; and matching the one or more second characteristic vectors with the plurality of the first characteristic vectors to recognize at least one anomaly when compared to an acoustic normal situation for this environment, when said computer program is run by a computer.

    17. An apparatus for recognizing acoustic anomalies, comprising: an interface for obtaining a long-term recording having a plurality of first audio segments associated to respective first time windows, and for obtaining a further recording having one or more second audio segments associated to respective second time windows; and a processor configured for analyzing the plurality of the first audio segments to obtain, for each of the plurality of the first audio segments, a first characteristic vector describing the respective first audio segment, and configured for analyzing the one or more second audio segments to obtain one or more characteristic vectors describing the one or more second audio segments, and configured for matching the one or more second characteristic vectors with the plurality of the first characteristic vectors to recognize at least one anomaly when compared to an acoustic normal situation for this environment.

    18. The apparatus in accordance with claim 17, the apparatus comprising a microphone or a microphone array connected to the interface.

    19. The apparatus in accordance with claim 17, the apparatus comprising an output interface for outputting a probability of occurrence of the respective first audio segment having the respective first characteristic vector or for outputting a probability of occurrence of the respective first audio segment having the respective first characteristic vector and a first time window.

    Description

    BRIEF DESCRIPTION OF THE DRAWINGS

    [0029] Embodiments of the present invention will be discussed below referring to the appended drawings, in which:

    [0030] FIG. 1 is a schematic flow chart for illustrating the method in accordance with a basic embodiment;

    [0031] FIG. 2 shows a schematic table for illustrating different types of anomalies; and

    [0032] FIG. 3 is a schematic block circuit diagram for illustrating an apparatus in accordance with another embodiment.

    DETAILED DESCRIPTION OF THE INVENTION

    [0033] Before discussing the following embodiments of the present invention making reference to the appended drawings, it is pointed out that elements and structures of equal effect are provided with equal reference numbers so that the description thereof is mutually applicable or interchangeable.

    [0034] FIG. 1 shows a method 100 subdivided into two phases 110 and 120.

    [0035] In the first phase 110, which is referred to as adjusting phase, there are two basic steps. This is indicated by the reference numerals 112 and 114. Step 112 comprises a long-term recording of the acoustic normal state in the application scenario. The analysis apparatus 10 (cf. FIG. 3) is exemplarily set up in the target environment so that a long-term recording 113 of the normal state is detected. This long-term recording may exemplarily have a duration of 10 minutes, 1 hour, or 1 day (generally greater than 1 minute, greater than 30 minutes, greater than 5 hours or greater than 24 hours and/or up to 10 hours, up to 1 day, up to 3 days or up to 10 days (including the time windows defined by the upper and lower).

    [0036] This long-term recording 113 is then subdivided, for example. The subdivision may be performed to form time regions of equal duration, like 1 second or 0.1 second, for example, or dynamic time regions. Everytime region comprises an audio segment. In step 114, which is generally referred to as analyzing, this audio segment is examined separately or in combination. When analyzing, a so-called characteristic vector 115 (first characteristic vectors) is determined for each audio segment. Expressed generally, this means that a conversion from a digital recording 113 to one or more characteristic vectors 115—for example by means of deep neural networks—takes place, wherein each characteristic vector 115 “encodes” the sound at a certain point in time. Characteristic vectors 115 can, for example, be determined by an energy spectrum for a certain frequency range or, generally, a time-frequency spectrum.

    [0037] It is to be pointed out here that, optionally, it is possible to reduce the dimensionality of the characteristic space of the characteristic vectors 115 by means of statistical methods (like main-component analysis). In step 114, optionally, typical or dominant noises can be identified by means of unmonitored learning methods (like clustering). Here, time sections or audio segments comprising similar characteristic vectors 115 and correspondingly comprising a similar sound are grouped together. No semantic classification of a noise (like “car” or “airplane”) is necessary here. This means that a so-called unmonitored learning using frequencies of repeating or similar audio segments takes place. In accordance with another embodiment, it would also be conceivable for unmonitored learning of the temporal order and/or typical repetition patterns of certain noises to take place in step 114.

    [0038] The result of clustering is a composition of audio segments or noises, which are normal or typical of this region. Exemplarily, a probability of occurrence may be associated to each audio segment. Additionally, a repetition pattern or order, i.e. a combination of several audio segments, for which the current environment tis typical or normal can be identified. A probability can be associated here to each grouping, each repetition pattern or each series of different audio segments.

    [0039] At the end of the adjusting phase, audio segments or grouped audio segments are known and described as characteristic vectors 115 typical of this environment. In a next step or next phase 120, this learned knowledge is applied correspondingly. Phase 120 comprises three basic steps 122, 124, and 126.

    [0040] In step 122, an audio recording 123 is recorded. When compared to the audio recording 113, it is typically much shorter. This audio recording is, for example, shorter when compared to the audio recording 113. However, it may also be a continuous audio recording. This audio recording 123 is then analyzed in a downstream step 124. This step is comparable as regards contents to step 114. Again, the digital audio recording 123 is converted to characteristic vectors. When these two characteristic vectors 125 are finally present, they can be compared to the characteristic vectors 115.

    [0041] The comparison of step 126 is performed with the goal of determining anomalies. Very similar characteristic vectors and very similar orders of characteristic vectors hint at the fact that there is no anomaly. Deviations from patterns determined before (repetition patterns, typical orders etc.) or deviations from the audio segments determined before characterized by other/new characteristic vectors hint at an anomaly. These are recognized in step 126.

    [0042] In step 126, different types of anomalies can be recognized. Examples of these are: [0043] Sound anomaly (new sound unheard so far), [0044] Temporal anomaly (sound already heard occurs at an “unsuitable” time, is repeated too fast or occurs in a wrong order with other sounds), [0045] Spatial anomaly (sound heard already occurs at “unfamiliar” spatial position, or the corresponding source follows an unfamiliar spatial motion pattern).

    [0046] These anomalies will be discussed in detail referring to FIG. 2.

    [0047] Optionally, a probability can be output for each of the three types of anomalies at a time x. This is illustrated by the arrows 126z, 126k, and 126r (one arrow per type of anomaly) in FIG. 3.

    [0048] It is to be pointed out here that, when comparing the characteristic vectors, frequently there is not identity, but only similarity. This means that, in accordance with embodiments, threshold values can be defined of when characteristic vectors are similar or when groups of characteristic vectors are similar so that the result also presents a threshold value for an anomaly. This threshold value application can follow outputting the probability distribution or occur in combination, for example in order to allow more precise temporal recognition of anomalies.

    [0049] In accordance with further embodiments, it is also possible to recognize spatial anomalies. Here, step 114, in the adjusting phase 110, may also comprise unmonitored learning of typical spatial positions and/or movements of certain noises. Typically, in such a case, instead of the microphone 18 illustrated in FIG. 3, there are two microphones or a microphone array having at least two microphones. In such a situation, in the second phase 120, spatial localization of the current dominant sound sources/audio segments is also possible using a multi-channel recording. The basic technology may be beam forming, for example.

    [0050] Referring to FIGS. 2a-2c, three different anomalies will be discussed. FIG. 2a illustrates temporal anomaly. Respective audio segments ABC for both phase 1 and phase 2 are plotted along the time axis t. In phase 1, it was recognized that a normal situation or normal order is present such that the audio segments ABC occur in the order of ABC. For one of them, a repetition pattern was recognized so that, after the first group ABC, another group ABC may follow.

    [0051] When precisely this pattern ABCABC is recognized in phase 2, it can be assumed that there is no anomaly, or at least no temporal anomaly. If, however, the pattern ABCAABC illustrated here is recognized, there is a temporal anomaly since a further radio segment A is arranged between the two groups ABC. This audio segment A or abnormal audio segment A is provided with a double frame.

    [0052] A sound anomaly is illustrated in FIG. 2b. In phase 1, the audio segments ABCABC were again recorded along the time axis t (cf. FIG. 2a). The sound anomaly when recognizing shows in that another audio segment, in this case the audio segment D, occurs in phase 2. This audio segment D is of increased length, i.e. extends over two time regions and therefore is illustrated as DD. The sound anomaly is provided with a double frame in the order of types of the audio segments. This sound anomaly may, for example, by a sound never heard during the learning phase. Exemplarily, this may be a thunder sound, which differs from previous elements ABC as regards loudness/intensity and as regards length.

    [0053] A spatial anomaly is illustrated in FIG. 2c. In the initial learning phase, two audio segments A and B were recognized at two different positions, position 1 and position 2. During phase 2, both elements A and B were recognized again, wherein localization determined that both the audio segment A and the audio segment B are located at position 1. This means that the presence of audio segment B at the position 1 is a spatial anomaly.

    [0054] Referring to FIG. 3, an apparatus 10 for sound analysis will be discussed. The apparatus 10 basically comprises the input interface 12, like a microphone interface, and a process 14. The processor 14 receives the one or more (present at the same time) audio signals from the microphone 18 or the microphone array 18′ and analyzes the same. Here, it basically performs steps 114, 124, and 126 discussed in connection with FIG. 1. The result to be output (cf. output interface 16) for each phase is a set of characteristic vectors representing the normal state, or, in phase 2, an output of the recognized anomalies, for example associated to a certain type and/or associated to a certain point in time.

    [0055] Additionally, at the interface 16, a probability of anomalies or probability of anomalies at certain points in time or, generally, a probability of characteristic vectors at certain points in time can be determined.

    [0056] In accordance with embodiments, the apparatus 10 or the audio system is configured to recognize (simultaneously) different types of anomalies, like at least two anomalies, for example. The following fields of application are conceivable: [0057] Security monitoring of buildings and facilities [0058] Detection of burglary (like glass breaking)/damage (vandalism) [0059] Predictive Maintenance [0060] Recognizing the onset of abnormal machine behavior due to unfamiliar sounds [0061] Monitoring public spaces/events (sports events, music events, demonstrations, rallies, etc.) [0062] Recognizing danger noises (explosion, gunshot, cries for help) [0063] Traffic monitoring [0064] Recognizing certain vehicle noises (like spinning wheels—speeders) [0065] Logistics monitoring [0066] Monitoring construction sites—recognizing accidents (collapse, cries for help) [0067] Health [0068] Acoustic monitoring of the normal everyday life of elderly/ill people [0069] Recognizing people falling/crying for help

    [0070] Although some aspects have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method such that a block or device of an apparatus also corresponds to a respective method step or a feature of a method step. Analogously, aspects described in the context with or as a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus. Some or all of the method steps may be executed by (or using) a hardware apparatus, like, for example, a microprocessor, a programmable computer or an electronic circuit. In some embodiments, some or several of the most important method steps may be executed by such an apparatus.

    [0071] Depending on certain implementation requirements, embodiments of the invention may be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a Blu-Ray disc, a CD, ROM, PROM, EPROM, EEPROM or a FLASH memory, a hard drive or another magnetic or optical memory having electronically readable control signals stored thereon, which cooperate or are capable of cooperating with a programmable computer system such that the respective method is performed. Therefore, the digital storage medium may be computer-readable.

    [0072] Some embodiments according to the invention include a data carrier comprising electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.

    [0073] Generally, embodiments of the present invention can be implemented as a computer program product with program code, the program code being operative for performing one of the methods when the computer program product runs on a computer.

    [0074] The program code may, for example, be stored on a machine-readable carrier.

    [0075] Other embodiments comprise the computer program for performing one of the methods described herein, wherein the computer program is stored on a machine-readable carrier.

    [0076] In other words, an embodiment of the inventive method is, therefore, a computer program comprising program code for performing one of the methods described herein, when the computer program runs on a computer.

    [0077] A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein. The data carrier, the digital storage medium or the computer-readable medium are typically tangible and/or non-transitory.

    [0078] A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may, for example, be configured to be transferred via a data communication connection, for example via the Internet.

    [0079] A further embodiment comprises processing means, for example a computer, or a programmable logic device, configured or adapted to perform one of the methods described herein.

    [0080] A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.

    [0081] A further embodiment according to the invention comprises an apparatus or a system configured to transfer a computer program for performing at least one of the methods described herein to a receiver. The transmission can, for example, be performed electronically or optically. The receiver may, for example, be a computer, a mobile device, a memory device or the like. The apparatus or system may, for example, comprise a file server for transferring the computer program to the receiver.

    [0082] In some embodiments, a programmable logic device (for example a field-programmable gate array, FPGA) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field-programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, in some embodiments, the methods are performed by any hardware apparatus. This can be universally applicable hardware, such as a computer processor (CPU), or hardware specific for the method, such as ASIC.

    [0083] The apparatus described herein may be implemented, for example, using a hardware apparatus, or using a computer, or using a combination of a hardware apparatus and a computer.

    [0084] The apparatus described herein, or any component of the apparatus described herein may be implemented at least partly in hardware and/or software (computer program).

    [0085] The methods described herein may be implemented, for example, using a hardware apparatus, or using a computer, or using a combination of a hardware apparatus and a computer.

    [0086] The methods described herein, or any component of the methods described herein may be performed at least partly by hardware and/or software.

    [0087] While this invention has been described in terms of several embodiments, there are alterations, permutations, and equivalents which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations and equivalents as fall within the true spirit and scope of the present invention.

    SCIENTIFIC LITERATURE

    [0088] [Borges_2008] N. Borges, G. G. L. Meyer: Unsupervised Distributional Anomaly Detection for a Self-Diagnostic Speech Activity Detector, CISS, 2008, pp. 950-955. [0089] [Ntalampiras_2009] S. Ntalampiras, I. Potamitis, N. Fakotakis: On Acoustic Surveillance of Hazardous Situations, ICASSP, 2009, pp. 165-168. [0090] [Borges_2009] N. Borges, G. G. L. Meyer: Trimmed KL Divergence between Gaussian Mixtures for Robust Unsupervised Acoustic Anomaly Detection, INTERSPEECH, 2009. [0091] [Marchi_2015] E. Marchi, F. Vesperini, F. Eyben, S. Squartini, B. Schuller: A Novel Approach for Automatic Acoustic Novelty Detection using a Denoising Autoencoder with Bidirectional LSTM Neural Networks, ICASSP 2015, pp. 1996-2000. [0092] [Valenzise_2017] G. Valenzise, L. Gerosa, M. Tagliasacchi, F. Antopnacci, A. Sarti: Scream and Gunshot Detection and Localization for Audio-Surveillance Systems, IEEE ICAVSBS, 2017, pp. 21-26. [0093] [Komatsu_2017] T. Komatsu, R. Kondo: Detection of Anomaly Acoustic Scenes based an a Temporal Dissimilarity Model, ICASSP 2017, pp. 376-380. [0094] [Tuor_2017] A. Tuor, S. Kaplan, B. Hutchinson, N. Nichols, S. Robinson: Deep Learning for Unsupervised Insider Threat Detection in Structured Cybersecurity Data Streams, AAAI 2017, pp. 224231.