MULTIMODAL AND REAL-TIME METHOD FOR FILTERING SENSITIVE MEDIA

20170289624 · 2017-10-05

Assignee

Inventors

Cpc classification

International classification

Abstract

A multimodal and real-time method for filtering sensitive content, receiving as input a digital video stream, the method including segmenting digital video into video fragments along the video timeline; extracting features containing significant information from the digital video input on sensitive media; reducing the semantic difference between each of the low-level video features, and the high-level sensitive concept; classifying the video fragments, generating a high-level label (positive or negative), with a confidence score for each fragment representation; performing high-level fusion to properly match the possible high-level labels and confidence scores for each fragment; and predicting the sensitive time by combining the labels of the fragments along the video timeline, indicating the moments when the content becomes sensitive.

Claims

1. A multimodal and real-time method for filtering sensitive content, receiving as input a digital video stream, comprising: segmenting digital video into video fragments along a video timeline; extracting features containing significant information from the digital video input on sensitive media; reducing a semantic difference between each of low-level video features, and a high-level sensitive concept; classifying the video fragments, and generating a high-level label (positive or negative), with a confidence score for each fragment representation; performing high-level fusion to properly match the possible high-level labels and confidence scores for each fragment; and predicting sensitive moments by combining labels of the fragments along the video timeline, indicating the moments when the content becomes sensitive.

2. A multimodal and real-time method for filtering sensitive content according to claim 1, wherein the extracting the features comprises extracting visual, auditory or text features from a frame, audio, text extracted, respectively.

3. A multimodal and real-time method for filtering sensitive content according to claim 1, wherein the reducing the semantic difference between each of the low-level video features, and the high-level sensitive concept, in an offline operation comprises: analyzing dominant components by transforming the features into feature vectors; projecting the feature vector into another vector space after the transformations were learned in the analyzing; building one codebook for later reference by splitting a space of low-level descriptions in various regions where each region is associated with a visual visual/auditory/textual word, and storing these words in the codebook; mid-level coding to quantify each low-level feature vector extracted from the frames/audio/text with respect to its similarity to the words that compose the codebook; and grouping fragments by aggregating the quantization obtained from quantization of the encoding, and summarizing how the visual/auditory/textual words are being manifested.

4. A multimodal and real-time method for filtering sensitive content according to claim 1, wherein the reducing the semantic difference between each of the low-level video features, and high-level sensitive concept, in an online operation comprises the data projection, mid-level coding and grouping of fragments in which projection transformation previously learned and a codebook are read in data projection and in mid-level coding, respectively.

5. A multimodal and real-time method for filtering sensitive content according to claim 1, wherein classification of activity of video fragments, in an offline operation, comprises generating a prediction model which applies a supervised machine learning technique to deduce an ideal video fragments classification model, and this learned model is stored in a prediction model for later use in an online operation.

6. A multimodal and real-time method for filtering sensitive content according to claim 1, wherein the classifying the video fragments, in an online operation, comprises predicting a video segment class, wherein labels for each unknown video segment are predicted with a confidence score based on a prediction model that was previously learned/estimated in an offline operation.

7. A multimodal and real-time method for filtering sensitive content according to claim 1, wherein the performing the high-level fusion comprises: temporally aligning N video segment classifiers along the video timeline; representing an N-dimensional vector, which builds an N-dimensional vector for each instant of interest of a target video, and within this vector, every i-th component (with i belonging to the natural interval [1 . . . N]) holds a classification confidence score of the i-th fragment classifier, in relation to a video segment which reference moment coincides with a reference instant of interest; in an offline operation, generating a late fusion model receives a training dataset groundtruth, and employs a supervised machine learning technique to generate a good late fusion model, and a learned late fusion model is stored for later use; and in an online operation, predicting of an N-dimensional vector class retrieves the late fusion model, and predicts the labels for each N-dimensional vector with a proper confidence score; a classification score noise suppressing uses any kind of denoising function even to flatten a classification score, along the video timeline, then a classification score fusing combines scores of adjacent video instants of interest that belong to a same sensitive class according to decision thresholds.

8. A multimodal and real-time method for filtering sensitive content according to claim 1, which detects sensitive digital video content in real time in using low memory and computational footprint on low-powered devices comprising smartphones, tablets, smart glasses, virtual reality devices/displays, smart TVs and other video devices.

9. A multimodal and real-time method for filtering sensitive content according to claim 1, wherein said video fragments have a fixed or varied temporal size, and may or may not have temporal overlap.

10. A multimodal and real-time method for filtering sensitive content according to claim 3, wherein chosen parameters of algebraic processing in the analyzing the dominant components are learned/estimated from a learning/training dataset, and then stored in a projection transformation dataset for later use.

11. A multimodal and real-time method for filtering sensitive content according to claim 1, wherein a machine learning method comprises one of: support vector machine (SVM), Random Forests (Random Forests), and decision trees.

12. A multimodal and real-time method for filtering sensitive content according to claim 1, wherein, in a case of missing fragments, the confidence score has a complete uncertainty value that may be interpolated.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0073] The objectives and advantages of the present invention will become clearer through the following detailed description of the example and non-limitative figures presented at the end of this document, wherein:

[0074] FIG. 1 is a flowchart that depicts a sample embodiment of the present invention on a smartphone.

[0075] FIG. 2 is a flowchart that depicts a sample embodiment of the present invention on a Smart TV.

[0076] FIG. 3 is a flowchart that depicts the overview operation of the present invention.

[0077] FIG. 4 is a flowchart that depicts the offline operation of the present invention, which corresponds to the training phase of the method.

[0078] FIG. 5 is a flowchart that depicts the online operation (connected) according to an embodiment of the proposed invention, which corresponds to the execution phase (regular use) of the method.

[0079] FIG. 6 is a flowchart that depicts the high-level fusion solution according to an embodiment of the proposed invention.

DETAILED DESCRIPTION OF THE INVENTION

[0080] The following description is presented to enable any person skilled in the art to make and to use the embodiments, and is provided in the context of particular applications and their requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present disclosure. Thus, the present invention is not limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features herein disclosed.

[0081] The detailed description of the present invention follows a top-down approach. Hence, we start with the disclosure of two sample embodiments (FIGS. 1 and 2), to clarify the purpose of the invention. In the sequence, we depict an overview of the proposed method (FIG. 3), and then we delve into the details of the offline and online operations for each type of extracted low-level features (visual, auditory, and textual) (FIGS. 4 and 5). In the end, we explain the high-level technique of fusion (FIG. 6).

[0082] FIG. 1 is a flowchart that depicts a possible embodiment of the present invention. The solid arrow represents the sequence of events (2) within the embodiment execution.

[0083] The action starts with a user 1 using her smartphone 2, where a system implementing the proposed invention was previously deployed in the form of a scanning app 3. The app locates all the video files stored in the device and additional videos that can be stored in memory cards 4, and starts scanning them to identify which files present sensitive content (e.g., violence, pornography).

[0084] The progress of the file scanning can be checked by the means of a progress bar 5, and the sensitive videos are iteratively enlisted 6. One can note that the smartphone 2 may stay offline during the entire scanning process (what is shown by the means of flight mode 7 and no wireless connections 8). It means that the scanning and sensitive content detection processes are performed locally, with no need of additional processing steps in external or remote machines, despite eventual memory and processing restrictions of the smartphone.

[0085] In the process of sensitive video detection, visual, auditory and/or textual features are extracted from the video files, to support the app execution.

[0086] FIG. 2 is a flowchart that depicts another possible embodiment of the present invention. The solid arrows represent the sequence of events (3) within the embodiment execution.

[0087] The action starts with a user 1 operating her Smart TV 10 with a regular remote control 9. The equipment runs under a kid's safe mode (represented by a proper status icon 11). When activated, the kid's safe mode provides—as a background service a real-time analysis of every video stream that the TV is demanded to play. The kid's safe mode can be activated and deactivated by the means of the TV settings 12 or by a “Safe mode button/function” in the remote control 9, and it runs locally, with no need of connections to remote servers or external databases.

[0088] The user 1 chooses to watch a regular web video stream 13, what leads to a video content being played 14. Given that the TV is under safe mode, the video content is always analyzed before disclosure, in a manner that is transparent to the user. Without any human supervision or awareness, whenever the video becomes sensitive, the visual and auditory contents are censored on demand 15.

[0089] In the process of sensitive content detection, visual, auditory and/or textual features are extracted from the video streams, to support the service execution.

[0090] FIG. 3 is a flowchart that depicts the overview operation of the method of the present invention. Each rectangular box is an activity, and the arrows represent the precedence of activities. Some activities are interleaved by black icons 16, 17, 18 and 19, that put in evidence the type of data that shall flow between the activities. Dashed arrows represent a simple flow of data, and a parallelogram represents output.

[0091] Regardless of being offline or online, the method operation 100 starts from the Digital Video file or stream 16, which is segmented in video snippets along the video timeline. These snippets may have fixed or varied temporal length, and they may or may not present temporal overlap.

[0092] Once the snippets are produced by the Video Snippet Segmentation activity 110, then Features Extraction activity 120 is performed, in order to generate a set of features (a.k.a. “feature vectors”) which contains the significant information from the input data (i.e., the Digital Video 16) regarding sensitive media. Each one of the snippets is subject to three types of low-level Feature Extraction 120:

[0093] 1. Visual Feature Extraction 122 regards the processes that analyze the Frames 17 of the video snippets, previously extracted by Frame Extraction 121. These processes include any type of global or local still-image description method, interest point descriptor, or space-temporal video description solution that may be available to the method embodiment.

[0094] 2. Auditory Feature Extraction 124 is related to the processes that analyze the Audio 18 of the video snippets, previously extracted by Audio Extraction 123. These processes include any type of audio description solution (e.g., MFCC) that may be available to the method embodiment.

[0095] 3. Textual Feature Extraction 126 concerns the processes that analyze any Text 19 that may be associated to a video snippet (e.g., subtitles and closed caption), previously extracted by Text Extraction 125. These processes include any type of text description solution (e.g., stem frequency, etc.) that may be available to the method embodiment.

[0096] The activities of feature extraction 120, and more specifically items 122, 124, and 126 conclude the low-level stage of the proposed method (Low-level Feature Extraction stage, in FIG. 3).

[0097] In the sequence, each possible process of low-level feature extraction follows an independent path through the Video Snippet Mid-level Aggregate Representation 130, which is responsible for reducing the semantic gap that exists between each one of the low-level video features, and the high-level sensitive concept. For doing so, it constitutes the mid-level stage of the method operation (Mid-level Video Snippet Representation stage, in FIG. 3). More details on the mid-level representation (130) are given in FIGS. 4 and 5, which will be further explained.

[0098] The Video Snippet Classification activity 140, on turn, outputs a high-level label (positive or negative), with a confidence score, for each snippet representation. It thus starts the high-level stage of the proposed method (High-level Snippet Classification stage, in FIG. 3).

[0099] Given that each snippet may have various representations—and therefore, various high-level labels and confidence scores—the High-level Fusion activity 150 is responsible for taking the labels of the snippets and combining them along the video timeline, in order to obtain the moments when the content becomes sensitive. In the end, the Sensitive Moment Prediction 160 outputs the prediction of the sensitive video moments, what concludes the High-level Fusion stage, in FIG. 3. More details on the high-level fusion are given in FIG. 6.

[0100] It is noteworthy to mention that the present method does not work exclusively for a particular type of sensitive content (e.g., only for pornography, or only for violence). It works for any concept of interest.

Offline or Disconnected Execution (Training/Learning Phase)

[0101] FIG. 4 is a flowchart that depicts the offline operation of the proposed method, which corresponds to the training phase of the method. Each rectangular box is an activity, and the solid arrows represent the precedence of activities. Dashed arrows represent a simple flow of data, and cylinders represent any sort of storage device. The flowchart is generically represented to deal with visual, auditory and textual information, since the step sequences are similar, and particularities regarding each type of information will be properly described when necessary.

[0102] The depicted operation is offline, what means that it aims at training the method. This Training Phase (offline operation) must be done before the regular execution (online operation), in order to generate a mathematical model that is able to predict the class of unknown videos that will be analyzed during the regular execution (online operation). The offline operation thus starts by taking known and previously labeled training samples, which are stored as either Positive Video Snippets 20, or Negative Video Snippets 21.

[0103] Following, video information (frames, audio and/or text) is extracted in the Frame/Audio/Text Extraction activities 121, 123, 125. At this point, if the video information is visual, any type of frames may be used (e.g.: I, P or B frames, or even all the frames taken at a chosen or random frame rate). Moreover, the frame resolution may be simply maintained, or it may be reduced, whether for the sake of any further savings of computational time, or for any other reasons.

[0104] In the sequence, the Visual/Audio/Text Feature Extraction activities 122, 124, 126 are performed. In case of visual information (frames), it provides the low-level description of the extracted frames, by the means of any type of global or local still-image descriptor, interest point descriptor, or space-temporal video descriptor. Typical examples from the literature may include (but are not limited to) Scale Invariant Feature Transform (SIFT), Speeded Up Robust Features (SURF), Histogram of Oriented Gradients (HOG), Space Temporal Interest Points (STIP), etc. In case of auditory information (audio), it provides the low-level description of the audio snippets, and solutions to perform it may include (but are not limited to) Mel-frequency Cepstral Coefficients (MFCC), brightness, tonality, loudness, pitch, etc.

[0105] As a result of the feature extraction, the extracted information (visual, auditory and/or textual) is translated into feature vectors, which are susceptible to the application of diverse algebraic transformations that may enhance the quality of data (e.g., decorrelate the feature vector components, etc.). This is the aim of the Dominant Component Analysis activity 131, which analyses the numeric behavior of the feature vector components, and estimates algebraic transformations that may improve further separations of data samples. An example of doing so is by the application of Principal Component Analysis (PCA), but it is not limited to that. Because of such step, parameters of the chosen algebraic transformation are learned (a.k.a., estimated) from the training dataset, and they need to be stored for further use (what leads to the Projection Transformation data 22).

[0106] Once the parameters of the algebraic transformation are learned, the feature vectors are projected onto another vector space, task that is related to the Data Projection activity 132. Besides that, for the sake of saving computational time, it is common (but it is not an indispensable requirement) to project the feature vectors to another space that presents less components than the original one (i.e., the feature vectors are converted to smaller vectors, a.k.a. dimensionality reduction).

[0107] Prior to the mid-level aggregation representation of the low-level features, there is the necessity to construct the Codebook 23, for posterior reference. Such task is linked to the Codebook Construction activity 133, and usually there may be a codebook for each type of video information (visual, auditory and textual). There, the basic idea is to somehow split the space of low-level descriptions into multiple regions, where each region is associated to a visual/auditory/textual word. Thus, by the storage of these visual/auditory/textual words, we have a representative codebook 23. Strategies to construct the codebook may vary a lot. For instance, they may comprise (but are not limited to) unsupervised learning techniques, such as k-means clustering, or other clustering method (e.g., k-medians), etc. In a different fashion, other solution developers manage to use even simpler strategies, such as randomly sampling the description space, in order to raffle k representatives. Additionally, more sophisticated strategies can also be used, such as the application of an Expectation-Maximization (EM) algorithm to establish a Gaussian Mixture Model (GMM) on the low-level description space. In addition, content-aware approaches may be employed, where the codebook construction is done by the selection of a controlled number of representative feature vectors from each known problem class.

[0108] Once the codebook 23 is obtained, the next step comprises the Mid-level Encoding activity 134. This step aims at quantifying every low-level feature vector extracted from the frames/audio/text (previously on activities 122, 124, 126), with respect to their similarity to the words that compose the codebook 23. Techniques to do that may include (but are not limited to) hard- or soft-coding, and Fisher Vectors.

[0109] The following step, Snippet Pooling 135, aggregates the quantization obtained in the previous encoding step, by summarizing—in a single feature vector for each video snippet—how often the visual/auditory/textual words are being manifested. Strategies to do that may include (but are not limited to) sum, average or max pooling.

[0110] The steps 131-135 are considered sub-tasks of Video Snippet Mid-level Aggregate Representation 130.

[0111] Finally, from the mid-level aggregate representation of each training video snippet—whose labels are known in advance—a supervised machine learning technique can be employed to deduce a “good” video snippet classification model (i.e., a mathematical model that is able to predict, with high accuracy and enriched by a confidence score, the label of unknown video snippets). That is related to the Prediction Model Generation activity 141, and the learned/estimated Prediction Model 24 must be stored for further use (regular, online operation/execution). Usually, there may be a Prediction Model 24 for each type of video information (visual, auditory and textual). Many machine learning solutions may be applied to this last classification process. Alternatives may comprise (but are not limited to) Support Vector Machines (SVM), including the many SVM variations regarding the type of kernel function that is used to learn the data separation hyperplane, Random Forests, decision trees, etc.

Online or Connected Execution (Regular Use, or Execution Phase)

[0112] FIG. 5 is a flowchart that describes the online operation of the proposed method, which corresponds to the execution phase (regular use) of the method. Each rectangular box is an activity, and the solid arrows represent the precedence of activities. Dashed arrows represent a simple flow of data, and cylinders represent any sort of storage device. The flowchart is generically represented to deal with visual, auditory and textual information, since the steps sequence are similar, and particularities regarding each type of information will be properly described when necessary.

[0113] The described operation is online, what means that it represents a regular use of the method, when an Unknown Digital Video 25 is presented for analysis. As mentioned, at this point, the training phase or offline operation (depicted in FIG. 4) was already done.

[0114] In the Low-level Feature Extraction stage, the video is first segmented into video snippets, along the video timeline (Video Snippet Segmentation activity 110). As mentioned in the Method Overview (FIG. 3), these snippets may have fixed or varied temporal length, and they may or may not present temporal overlap.

[0115] In the sequence, Frame/Audio/Text Extraction (activities 121, 123, 125) and Visual/Audio/Text Feature Extraction (activities 122, 124, 126) must be performed in the same way as in the offline operation (please refer to FIG. 4).

[0116] Thereafter, Data Projection 132, Mid-level Encoding 134, and Snippet Pooling 135—that are also the same performed in the offline operation (see FIG. 4)—are executed one after the other. These steps 132, 134 and 135 are sub-tasks of Video Snippet Mid-level Aggregate Representation 130 (FIG. 3), and constitutes the Mid-level Video Snippet Representation stage. Please notice that, at this stage, the previously learned (during the offline operation, training phase, FIG. 4) Projection Transformation 22 and Codebook 23 are read/retrieved by activities Data Projection 132 and Mid-level Encoding 134, respectively.

[0117] In the end, in the High-level Video Snippet Classification stage, the labels of each unknown video snippet are predicted, with a confidence score, based on the Prediction Model 24 that was previously learned/estimated in the offline operation (FIG. 4). The prediction task is thus related to the Video Snippet Class Prediction activity 142, and it depends on the machine learning technique used to generate the Prediction Model 24. Alternatives may comprise (but are not limited to) Support Vector Machines (SVM), Random Forests, decision trees, etc.

[0118] Despite it is not illustrated in FIG. 5, the online operation of the proposed method continues to the next, final steps (as depicted in FIG. 3). Given that each snippet may have various representations—and therefore, various high-level labels and confidence scores provided by the previous step (Video Snippet Class Prediction 142)—, the High-level Fusion activity 150 is responsible for soundly combining them in a single answer. Then, in the end, the Sensitive Moment Prediction 160 outputs the prediction of the moments when the content becomes sensitive (i.e., pornography, violence, adult content or any other concept of interest that was previously trained and modeled by the offline operation of the proposed method). More details on the high-level fusion are given in FIG. 6.

[0119] In the online operation, when the proposed method detects sensitive content within an Unknown Digital Video, many actions can be taken in order to avoid the presentation of undesirable content, for instance (but not limited to) substitute the set of video frames with sensitive content by completely black frames, blurring the sensitive video frames, or displaying an alert/warning.

High-Level Fusion Solution

[0120] FIG. 6 is a flowchart that describes the high-level fusion solution 150 of the method of the present invention. Each rectangular box is an activity, and the solid arrows represent the precedence of activities. Dashed arrows represent a simple flow of data, and cylinders represent any sort of storage device. The diamond, in turn, represents a conditional branch, which provides two different paths on the flow: one for the offline method operation, and another for the online operation. A parallelogram represents input/output.

[0121] As it is shown by the means of items 26 to 29, the fusion starts from the class predictions of diverse video snippets, which are grouped accordingly to the low-level feature extraction method that was employed to describe them. Therefore, item 26, for instance, may refer to the output predictions of a visual-based video snippet classifier that relied on SIFT (Low-level Feature 1) to describe the video content at the low-level stages of the proposed method. Similarly, item 27 may refer to the outputs of an auditory-based classifier that relied on MFCC (Low-level Feature 2), while item 28 may refer to a visual-, SURF-based video snippet classifier (Low-level Feature 3). Finally, item 29 may refer to a textual-based classifier (Low-level Feature N). The number N of fused classifiers may be even or odd, ranging from a single one, to a ton of classifiers. Moreover, the nature of the employed low-level features may be any of the possible ones (either visual, or auditory, or textual), no matter their order, majority or even absence (no use of textual features, for instance).

[0122] In the sequence, the outputs of the N video snippet classifiers 26 to 29 may be aligned along the video timeline, as a manner to organize how the different classifiers evaluated the sensitiveness of the video content. This is the task related to the optional Snippet Temporal Alignment activity 151, which presumes that the video snippets have a reference time (i.e., an instant within the original video timeline that the snippet is more representative of). A snippet reference time may be the first instant of the snippet, but alternatives may consider the most central or even the last instant.

[0123] Next, an N-dimensional vector is constructed for every instant of interest of the target video (e.g., for every second of video). Within this vector, every i-th component (with i belonging to the natural interval [1 . . . N]) must hold the classification confidence score of the i-th snippet classifier, regarding the video snippet whose reference time coincides with the instant of interest. In the case of missing snippets, the confidence score may be assumed as a value of complete uncertainty (e.g., 0.5, in the case of a normalized confidence score, which varies from zero—i.e., no confidence at all—to one—i.e., total confidence), or it may be interpolated. Such task of N-dimensional vector representation is related to the N-dimensional Vector Representation activity 152.

[0124] In the offline operation (training/learning phase) of the method, various training video samples and their respective classified snippets have their classification scores combined into these N-dimensional score vectors. Considering that each N-dimensional vector represents an instant of interest within the target video, the labels of such vectors are deductible from the training dataset groundtruth 30, 31, as long as the training dataset is annotated at frame level. Therefore, the Late Fusion Model Generation activity 153 receives the training dataset groundtruth (represented by the Positive Groundtruth and Negative Groundtruth storages, respectively 30 and 31), and employs a supervised machine learning technique to generate a good late fusion model: i.e., a mathematical model that is able to predict, with high accuracy and enriched by a confidence score, the label of an unknown N-dimensional vector. The learned Late Fusion Model 32 must be stored for further use (during regular, online use/execution). At this point, many machine learning solutions may be applied, for instance (but not limited to) SVM, Random Forests, and decision trees.

[0125] Concerning the online operation, an unknown video sample and its respective video snippets have their classification scores properly combined into the N-dimensional score vectors (on activity 152). At this point, it is important to mention that the order in which the outputs of the video snippet classifiers are combined must be the same that was adopted in the offline fusion operation.

[0126] Thereafter, the N-dimensional Vector Class Prediction activity 154 retrieves the Late Fusion Model 32, and predicts the labels of each N-dimensional vector, with a proper confidence score. Given that each N-dimensional vector represents an instant of interest within the unknown video, the predicted labels actually predict every instant of interest of the video.

[0127] Notwithstanding, giving a classification confidence score for every video instant of interest may generate a very noise answer in time, with interleaving positive and negative segments at an unsound rate that may change too much and too much fast, regarding the actual occurrence of enduring and relevant sensitive events. Hence, in the Classification Score Noise Suppression activity 155, any kind of denoising function can be used to flatten the classification score, along the video timeline. Strategies to do that may include (but are not limited to) the use of Gaussian blurring functions.

[0128] Next, the Classification Score Fusion activity 156 aims at combining the scores of adjacent video instants of interest that belong to the same sensitive class, according to decision thresholds. The inherent idea, therefore, is to substitute the sequences of diverse scores by a single and representative one, which may persist for a longer time, thus better characterizing the sensitive or non-sensitive video moments. Strategies to do that may comprise (but are not limited to) assuming a score threshold t, and then substituting all the time adjacent scores equal to or greater than t by their maximum (or average) value, and all the time adjacent scores smaller than t by their minimum (or average) value.

[0129] Finally, the Sensitive Moment Prediction 160 outputs the prediction of the moments when the content becomes sensitive (i.e., pornography, violence, adult content or any other concept of interest that was previously trained and modeled by the offline operation of the proposed method).

Experiments and Results

[0130] In the context of the experiments using the proposed method of the present invention, we report the results for pornography classification on Pornography-2K dataset. It comprises nearly 140 hours of 1000 pornographic and 1000 non-pornographic videos, which varies from six seconds to 33 minutes.

[0131] To evaluate the results of our experiments, we apply a 5×2-fold cross-validation protocol. It consists of randomly splitting the Pornography-2K dataset five times into two folds, balanced by class. In each time, training and testing sets are switched and consequently 10 analyses for every model employed are conducted.

[0132] The method of the present invention has a classification accuracy of 96% for pornography, and the analysis takes about one second per analyzed frame in a mobile platform (which has computational/hardware restrictions). The method does not require the analysis of all frames in the video in order to reach its high accuracy. Just one frame must be analyzed per second. For instance, if the video has 30 or 60 frames per second, the required rate of frames to be analyzed is 1 every 30 or 60 frames. Therefore, in real-time execution, the analysis time is always lesser than video timeline.

[0133] Regarding violence classification, as mentioned, there is a lack of a common definition of violence, absence of standard datasets, and the existing methods were developed for a very specific type of violence (e.g., gunshot injury, war violence, car chases). Consequently, the results were not directly comparable. For this reason, the proposed method was tested on a benchmarking initiative dedicated to evaluate new methods to automated detection of violent scenes in Hollywood movies and web videos, called MediaEval Violent Scenes Detection (VSD) task, which provides a common ground truth and standard evaluation protocols. The proposed method obtained a classification accuracy of 87% for violence.

[0134] These results represent an efficient and effective classification of diverse sensitive media on mobile platforms.

Applications

[0135] There are many applications for the method of the present invention:

[0136] detecting, via surveillance cameras, inappropriate or violent behavior;

[0137] blocking undesired content from being uploaded to (or downloaded from) general purpose websites (e.g., social networks, online learning platforms, content providers, forums), or from being viewed on some places (e.g., schools, workplaces);

[0138] preventing children from accessing adult content on personal computers, smartphones, tablets, smart glasses, Virtual Reality devices, or smart TVs; and

[0139] avoiding that improper content is distributed over phones by sexting, for instance.

[0140] Although the present invention has been described in connection with certain preferred embodiments, it should be understood that it is not intended to limit the invention to those particular embodiments. Rather, it is intended to cover all alternatives, modifications and equivalents possible within the spirit and scope of the invention as defined by the appended claims.