SYSTEM AND METHODS TO CREATE MULTI-FACETED INDEX FOR INSTRUCTIONAL VIDEOS

Abstract

Features are extracted from visual and audio modalities of a video to infer the location of figures/tables/equations/graphs/flow-charts determined as video anchor points which are highlighted on the video timeline to enable quick navigation and provide a quick summary of the video.

A voice-based mechanism navigates to a point-of-interest in the video.

In case of bandwidth-constrained settings, videos are often played at a very low resolution (quality), and often users need to increase video resolution manually to understand content presented in the figures. Using the automatic identification of these aforementioned anchored points, the resolution can be changed dynamically during streaming a video, which will provide a better viewing experience.

Claims

1. An indexing system for video content including a plurality of anchor points of selected interest to a user of the video content, comprising: an anchor point identifying processor configured to identify occurrence of selected ones of the anchor points based on high-level visual elements, the high-level visual elements are identified from multi-modal inputs, such as audio and visual data from a video and textual data obtained from the audio and visual data; a label assignation processor configured to assign labels to the identified anchor points; and an index generator configured to form and display in association with the video content a multi-faceted index, comprising a relative representation of position of a selected one of this anchor points within the video content, and a 2-dimensional timeline identifying a plurality of categories of the anchor points.

2. The system of claim 1, comprising a resolution changer configured to adjust displayed resolution of a one of the anchor points selected for display by the user via the index.

3. The system of claim 1, wherein the identifying processor comprises a segmentation of text and non-text regions.

4. The system of claim 3, wherein the identifying processor comprises optical character recognition of the text region for the anchor points.

5. The system of claim 3, wherein the identifying processor comprises feature extraction and classification of the non-text region for the anchor points.

6. The system of claim 5, wherein the feature extraction and classification comprises SIFT and SURF extraction.

7. The system of claim 5, wherein the feature extraction and classification comprises conventional neural network classification.

8. The system of claim 1, wherein the index generator comprises a timeline relative presentation of anchor point position in the video content.

9. The system of claim 1, wherein the index generator comprises a topic table.

10. The system of claim 1, including a voice based navigation processor configured to navigate to an identified anchor point.

11. The system of claim 1, wherein the label assignation processor references associated topic discussion to determine the assigned topic.

12. The system of claim 11, wherein the label assignation processor comprises at least one of stop-words removal, time specific relevance, many to one stemming and majority vote inversion.

13. The system of claim 1, further comprising: an index generator further configured to provide an overlay or interactive summary of anchor points over the video, wherein the anchor points provide a link to relevant section of the video.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0008] FIG. 1 is a process diagram for localizing anchor points in a video;

[0009] FIG. 2 is a flowchart illustrating a process to assign labels to discovered anchor points;

[0010] FIG. 3 shows multi-faceted indexing of video content with time location indexing; and

[0011] FIGS. 4A and 4B show alternative multi-faceted indexing of anchor points in a video with a 2D timeline.

DETAILED DESCRIPTION

[0012] The proposed embodiments comprise at least three different components i.e. spatial and temporal localization of anchor points (i.e. diagrams, figures, tables, flowcharts, an equation, chart/graphs, code snippets and the like), voice-based video navigation, and anchor point assisted video streaming. The pipeline/block diagram of the proposed system is shown in FIG. 1.

[0013] The anchor points are typically portions of the video having an increased interest or special content to the student viewing the video, and thus are preferably accessible to the student in a more convenient and expeditious manner. They can be identified in several ways.

[0014] In FIG. 1 the identifier blocks may comprise a single or several processors, either hardware or software, for effecting the stated functions. More particularly, an anchor point identifying processor comprises several of the illustrated blocks below.

[0015] Frame Extraction:

[0016] First, a conversion formatting tool such as “FFmpeg” is used to extract 10 all the frames from an educational video.

[0017] Segmentation of Text/Non-Text Regions:

[0018] Next, a text localization algorithm is used 12 to segment out the text and non-text regions in each frame. Once this segmentation is performed each of these streams is processed separately 14, 16 along with the speech-to-text transcript (if available) to determine the locations of these anchor points in the video. Deep/shallow features are extracted 14 from the non-text regions.

[0019] Optical character recognition (“OCR”) is performed 16 on the text regions to determine if there are any indications of presence of any anchor points in this frame or in the frames nearby. Verbal or printed cues are looked for such as “In this figure”, “In this table”, “look at the table” etc., as typical indicators. Co-reference resolution is also performed on the text to properly connect pronouns or other referring expressions to the right anchors. For example, when the teacher says “look at this” and points towards the figure shown at a slide, it can be automatically determined that he is referring to the figure in that slide, and the figure should be an anchor point.

[0020] Speech-to-text transcript processing is similarly performed on the speech-to-text transcript (if available) to determine the presence of similar cues in the text as printed text.

[0021] Feature extraction and classification includes determining 22 to which category (figure, equation, graph or table) a non-text region belongs. First, a large dataset of anchor images is collected along with their category labels. Different kinds of features are extracted from the training images and classifiers are built 14, 24 on top of those to automatically figure out the category of an unlabeled image. Machine statistical comparison techniques that are well known are employed to determine the video content category.

[0022] Shallow Models:

[0023] In this scenario, SIFT (scale invariant feature transform) and SURF (speeded up robust features) are extracted from the training images to create a bag-of-words model on the features. For example, 256 clusters in the bag-of-words model can be used. Then a support vector machine (SVM) classifier is trained using the 256 dimensional bag-of-features from the training data. For each un-labelled image (non-text region) the SIFT/SURF features are extracted and represented using the bag-of-words model created using the training data. The image is then fed into the SVM classifier to find out the category of the video content.

[0024] Deep Models:

[0025] convolutional neural networks (CNN) are used to classify non-text regions. CNNs have been extremely effective in automatically learning features from images. CNNs process an image through different operations such as convolution, max-pooling etc. to create representations that are analogous to human brains. CNNs have recently been very successful in many computer vision tasks, such as image classification, object detection, segmentation etc. Motivated by that, CNN for classification is used 22 to determine the anchor points. An existing convolution neural network called “Alexnet” is used to fine-tune the training images that are collected to create an end-to-end anchor point classification system. While fine-tuning the weights of the top layers of the CNN are modified while keeping the weights of the lower layers similar to the initial weights.

[0026] Decision Making Engine:

[0027] Once the classification is completed of the non-text regions into one of these classes and the presence of cues by the processing of the written text and the speech-to-text transcript, they are combined 26 using rule based systems to make the final prediction about the spatial and temporal presence of the anchor points. A multi-faceted index 28 (FIGS. 3, 4A and 4B) is determined and formatted for display from the final productions.

[0028] Voice-Based Video Navigation:

[0029] The resulting multi-faceted index is enriched with anchor points and helps in navigating to a point of interest in a video. The index is combined with voice-based interfaces is very helpful for differently abled people on mobile devices i.e. individuals with motor impairment (hand tremor), who have difficulty in navigating to a desired point of interest using traditional video timeline. Thus a student can use voice based retrieval or navigation 32 to locate a desired anchor point in the video. For example, user may specific a voice query, “go to flowchart where heap sort concept was discussed”, or a textual search query, “heap sort video with a flowchart”.

[0030] Apart from localization of a figure/table in a given video, the proposed embodiments help in tagging those images with concept-specific information (derived from visual and textual transcript information). For instance, if there is a table located in a video at a given time (say 13:15 minutes) in a given 20 minutes video. A set of labels are assigned to this table using the process described in FIG. 2.

[0031] FIG. 2 shows a step by step process to determine and assign topics or topic labels to the discovered anchor points.

[0032] Stop-Words Removal:

[0033] The video transcript 40 and any visual text from OCR 42 are processed in “stop words removal” 44 so that all the non-important words such as “A”, “An”, “The” are filtered. This step filters all the keywords which do not have relevance to core video content (i.e. concepts)

[0034] Time-Specific Relevance:

[0035] The thus filtered video content is processed in a “estimate time-specific relevance” step 46 to locate 48 the determined anchor points of the video. It finds out the time interval in a video where a particular anchor point has been discussed in the video.

[0036] Many to One Stemming:

[0037] The keywords identified in a specific time interval as described above are converted into their root form. (i.e. courses to course) 50.

[0038] Majority Vote Inversion:

[0039] Sometime stemming may lead to unreadable keyword forms and therefore, instead of their root form the most occurring form in a video is selected 52.

[0040] After all the above text processing steps, a set of representative keywords form the topic labels of every anchor point in a video is determined 54.

[0041] For example, if there is a flowchart discovered in a video, the assigned labels from the above process can describe which associated algorithms (or topic) has been discussed using the flowchart. These labels give extra information to associated anchor points and make voice-based retrieval much easier. Proposed embodiments support exemplary queries such as the following:

[0042] “Navigate to the flowchart which discusses heap sort algorithm.”

[0043] “Open the instance of the video which talks about binary search algorithm.”

[0044] “Navigate to the table which compares probability to gender employment.”

[0045] The anchor point responsive to such queries is located and displayed for the student.

[0046] Anchor points assisted video rate streaming is included in the embodiments for enhanced clarity of display to the student.

[0047] Current adaptive video streaming algorithms remain agnostic to the content of the video and as a result, many times offer poor quality of experience especially in low-bandwidth conditions. These streaming algorithms consider the bandwidth and CPU constraints for the adaptation purposes (i.e. segment size, duration, etc). The proposed embodiments provide methods to automatically localize salient portions of the video (in terms of anchor points), which can be used as one of the constraints for the adaptation purposes. For example, one optimization which can be performed is for the segments that contain one or more anchor points are sent 30 (FIG. 1) with a higher resolution, thus improving quality of user experience.

[0048] With reference to FIG. 3, a selected video entitled “Layer-Based Names” is indexed by a plurality of topics as:

TABLE-US-00001 Reference Models 0:11; OSI 7 layer Reference Models 1:03; Internet Reference Models 3:39; Internet Reference Models(2) 5:30; Internet Reference Models(3) 7:09; Standards Bodies 8:08; Layer-Based Names 11:20; Layer-Based Names(2) 12:37; Layer-Based Names(3) 13:28; and A Note About Layers 14:32.

[0049] With reference to FIG. 4A, for a video relating to computer network: a 2D timeline that identifies different categories of anchor points 60 is graphically shown. For example, the video contains three diagrams and two tables that may be easily identified by relative position in the content with access to the 2D timeline 62 so that a user can quickly go to the diagrams and tables without having to search through the entire content. A list of relevant topic labels 65 is also shown. FIG. 4B is similar to FIG. 4A in which the video topic of “Internet Reference Models” is shown to include two anchor points 64 via the 2D timeline 66 comprising a single diagram and a single label table. In this example, the diagram is shown at time 7:12 and is displayed to a user. Other topics are available in the list of topics 68.

[0050] The exemplary embodiments have been described with reference to the preferred embodiments. Obviously, modifications and alterations will occur to others upon reading and understanding the preceding detailed description. It is intended that the exemplary embodiment be construed as including all such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

SYSTEM AND METHODS TO CREATE MULTI-FACETED INDEX FOR INSTRUCTIONAL VIDEOS

Assignee

Inventors

Cpc classification

Classification Explorer

G11B27/105

PHYSICS

Classification Explorer

G06V10/70

PHYSICS

Classification Explorer

G11B27/28

PHYSICS

Classification Explorer

G09B5/065

PHYSICS

Classification Explorer

G11B27/34

PHYSICS

Classification Explorer

G06F18/21

PHYSICS

Classification Explorer

G06V10/40

PHYSICS

Classification Explorer

G06V30/224

PHYSICS

Classification Explorer

G09B5/08

PHYSICS

International classification

Classification Explorer

G09B5/06

PHYSICS

Classification Explorer

G06K9/18

PHYSICS

Classification Explorer

G11B27/34

PHYSICS

Classification Explorer

G10L15/22

PHYSICS

Classification Explorer

G11B27/28

PHYSICS

Classification Explorer

G09B5/08

PHYSICS

Classification Explorer

G11B27/031

PHYSICS

Classification Explorer

G06K9/46

PHYSICS

Classification Explorer

G06K9/62

PHYSICS

Abstract

Claims

Description