Multi-channel acoustic event detection and classification method

11830519 · 2023-11-28

Assignee

Inventors

Cpc classification

International classification

Abstract

A method for a multi-channel acoustic event detection and classification for weak signals, operates at two stages; a first stage detects a power and probability of events within a single channel, accumulated events in the single channel triggers a second stage, wherein the second stage is a power-probability image generation and classification using tokens of neighbouring channels.

Claims

1. A method for a multi-channel acoustic event detection and classification, comprising the following steps of: specifying a time window from raw acoustic signals, received from a multi-channel acoustic device in a synchronized fashion and stored in channel database, computing a power of each channel of channels for a specified window size, computing a classification probability of the raw acoustic signals for the time window, computing a cross product of the power and the classification probability and storing the cross product as a third dimension of a power-probability image to enrich an information capacity, wherein a first dimension, a second dimension and the third dimension of the power-probability image are respectively the power, the classification probability and the cross product of the power and the classification the classification probability, applying a convolutional neural network trained to detect spectrograms of acoustic events, denoted as a phoneme classifier, on the each channel independently, counting high-probability events exceeding a given threshold independently for the each channel using probability information from the power-probability image to detect possible channels with the high-probability events, recording the channels having a certain number of the high-probability events, exceeding the given threshold, to an event channel stack, cropping a region of interest around every event of interest, wherein the every event of interest is determined by a user in the each channel in the event channel stack, operating a power-probability classifier on accumulated results of phoneme classifier probabilities along with the power fora certain type of event classified by the phoneme classifier, reporting an event when the power-probability classifier generates a result exceeding a threshold for the event to be declared.

2. The method according to claim 1, comprising utilizing a synthetic activity generator to create possible event scenarios for a training along with actual data.

3. The method according to claim 1, wherein the power of the each channel for the specified window size is computed by: normalizing the power using a ratio of low-frequency components to high-frequency components, clipping the power from a top and a bottom and quantizing to a power quantization level in between, storing a quantized power in the power-probability image.

4. The method according to claim 1, wherein a machine learning technique for computing the classification probability of the raw acoustic signals for the time window is the convolutional neural network.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

(1) FIG. 1 shows a block diagram of the invention.

(2) FIG. 2 shows spectrogram of a variety of events.

(3) FIG. 3 shows a sample power-probability image.

(4) FIG. 4 shows noise background sample images.

(5) FIGS. 5, 6 and 7 show sample power-probability images for digging.

(6) FIG. 8 shows a sample network structure.

(7) FIG. 9 shows standard neural net and after applying dropout respectively.

DETAILED DESCRIPTION OF THE EMBODIMENTS

(8) Examining the power and probability of a channel independently creates false alarms. Most common false alarm source is the highway regions, which manifest itself as a digging activity due to bumps or microphones being close to the road. Considering several channels together enable the system adopting to the contextual changes such as vehicle passing by. This way system learns abnormal paint-strokes in power-probability image.

(9) As given in FIG. 1, the present invention evaluates the events in each channel independently using a lightweight phoneme classifier independently for each channel. Channels with certain number of events are further analysed by a context based power-probability classifier that utilizes several neighbouring channels/microphones around the putative event. This approach enables real-time operation and reduces the false alarm drastically.

(10) Proposed system uses three memory units: Channel database: Raw acoustic signals received from a multi-channel acoustic device in a synchronized fashion. Power-Probability image: Stores the power and probability token of each channel computed for a window. Image height defines the largest possible time duration an event can span, while image width indicates the number of channels/microphones. This image is shifted row-wise, while fresh powers and probabilities are inserted at the first row every time. This image contains the power, probability and cross product of these two features. Event-channel stack: Stores the indices of channels, whose individual voting exceeds a threshold and indicates a possible event.

(11) Proposed system uses two networks trained offline: Phoneme classifier: Network classifies acoustic features such as spectrograms using short time windows for a single channel. Power-probability classifier: Network that classifies events using multi-channel power, probability and its cross product.

(12) Online flowchart of the system is as following: A time window is specified that can summarize smallest acoustic event. Power is computed for the specified window size. Power is normalized using ratio of low-frequency components to high-frequency components. Power is clipped from top and bottom ([−30, 20] dB), and quantized to power quantization level number (20) in between. Quantized power is stored in power-probability image. Classification probability of the signal for time window is computed using machine learning. Convolutional neural networks (CNN) are utilized for this purpose, while other machine learning techniques can also be used instead. Computed classification probability is stored in the power-probability image for the event of interest. Notice that there is a different power-probability image for every event to be declared, such as walking, digging, excavation, vehicle. Cross product of power and probability is computed and stored as a third dimension of the image, to enrich the information capacity of the system. High-probability events which exceed a given threshold are counted for every channel independently from the power-probability image using probability information only. This voting scheme allows to detect possible channels with events. Every channels' probabilities are treated as a queue, such that old events are popped out of the queue using a time-to-live. Channels which have a certain number of events with high probability are recorded to the Event Channel Stack. For every event in Event Channel Stack For every event of interest determined by user Crop region of interest around the channel. Channel width (12) generates an image with width of 25. For a sampling rate of 5 Hz, and time span of 60 seconds, power probability image becomes 25×300. Convolutional neural network (CNN) trained for certain action is applied to the image for that channel region. Event is reported in case the power-probability classifier generates result exceeds threshold for the event.

(13) Offline flowchart of the system is as following: Acoustic phoneme based classifier is trained. A short time window is utilized such as 1.5 seconds to detect these acoustic phonemes. Spectrograms of acoustic events are shown in FIG. 2. Convolutional neural network is trained to detect these spectrograms. This network is denoted as phoneme classifier and is applied on each channel independently. (Results of this network is stored on image data base to be further evaluated later on.) This network is a generic one such that it classifies all possible events i.e. digging, walking, excavation, vehicle, noise. Power-probability classifier operates on the accumulated results of this phoneme classifier probabilities along with power for certain type of event. Synthetic activity generator is utilized to create possible event scenarios for training along with actual data.

(14) Power-probability image is a three channel input. First channel is the normalized-quantized power input. Second channel is phoneme probability. Third channel is the cross product of power and probability. (Power, Probability, Power*Probability)

(15) The power, probability and cross product result for a microphone array spread over 51.5 km can be found in FIG. 2. Following portion displays the last 20 km statistics. A digging activity at 46 km reveals itself at the cross product image Pow*Prob. Cross product feature is clean in terms of clutter. Feature engineering along with machine learning technique detects the digging pattern robustly.

(16) Devised technique can be visualized as an expert trying to inspect an art-piece and detect modifications on an original painting, which deviates from the inherent scene acoustics. In FIGS. 4-7, several examples of non-activity background and actual events are provided. An event creates a perturbation of the background power-probability image. Digging timing is not in synchronous with the car passing, hence horizontal strokes fall asynchronous with diagonal lines of vehicles. Hence, network learns this periodic pattern that occurs vertically considering the power and probability of the neighbouring channels.

(17) FIG. 8 shows a sample network structure. Dropout is used after fully connected layers in this structure. Dropout reduces overfitting so prediction being averaged over ensemble of models. FIG. 9 shows standard neural net and after applying dropout respectively.