INTELLIGENT CATALOGING METHOD FOR ALL-MEDIA NEWS BASED ON MULTI-MODAL INFORMATION FUSION UNDERSTANDING
20220270369 · 2022-08-25
Inventors
- Dingguo YU (Hangzhou, CN)
- Suiyu ZHANG (Hangzhou, CN)
- Liping FANG (Hangzhou, CN)
- Yongjiang QIAN (Hangzhou, CN)
- Yaqi WANG (Hangzhou, CN)
- Xiaoyu MA (Hangzhou, CN)
Cpc classification
G06V20/46
PHYSICS
Y02D10/00
GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
G06V20/49
PHYSICS
International classification
Abstract
The present disclosure provides an intelligent cataloging method for all-media news based on multi-modal information fusion understanding, which obtains multi-modal fusion features by unified representation and fusion understanding of video information, voice information, subtitle bar information, and character information in the all-media news, and realizes automatic slicing, automatic cataloging description, and automatic scene classification of news using the multi-modal fusion features. The beneficial effect of the present disclosure is that it realizes the complete process of automatic comprehensive cataloging for the all-media news, and improves the accuracy and generalization of the cataloging method, and greatly reduces the manual cataloging time by generating stripping marks, news cataloging descriptions, news classification labels, news keywords, and news characters based on the fusion of multi-modes of video, audio, and text.
Claims
1. An intelligent cataloging method for all-media news based on multi-modal information fusion understanding, comprising the following steps: 1) obtaining original news video, segmenting shot fragments, and locating scene key frames; 2) inferring scene classification labels from the scene key frames obtained in step 1) and merging adjacent shot fragments with similar scene labels to generate multiple slice fragments; 3) performing visual feature extraction on the slice fragments obtained in step 2) and generating news description text; 4) performing voice recognition on the slice fragments obtained in step 2) to obtain voice text; 5) extracting image frame recognition of the slice fragments obtained in step 2) to obtain subtitle bar text; 6) recognizing facial features in the slice fragments obtained in step 2) and matching the facial features in a news character database to obtain character information text; and 7) inputting the news description text obtained in step 3), the voice text obtained in step 4), the subtitle bar text obtained in step 5), and the character information text obtained in step 6) into a generative model of multi-modal fusion for processing to generate news keywords and comprehensive cataloging descriptions, and performing outputting after sorting and assembly to complete intelligent cataloging for the news.
2. The intelligent cataloging method for all-media news based on multi-modal information fusion understanding according to claim 1, wherein in step 1), a process of obtaining original news video, segmenting shot fragments and locating scene key frames specifically comprises: processing the original news video into a set of static image frames, calculating a histogram difference between each frame and the previous frame, setting a window range and a window moving step, taking a frame with the maximum difference in a window as a shot boundary frame, taking all frames between two shot boundary frames as the shot fragment, and extracting a middle frame of each shot fragment as the scene key frame of the fragment.
3. The intelligent cataloging method for all-media news based on multi-modal information fusion understanding according to claim 1, wherein in step 2), a process of inferring scene classification labels from the scene key frames obtained in step 1) and merging adjacent shot fragments with similar scene labels to generate multiple slice fragments specifically comprises: A) extracting visual features of each scene key frame through a trained residual network for a news scene classification task, and obtaining the scene classification labels for news scenes with the highest matching degree by inferring; B) based on the scene classification labels of each fragment obtained in step A), merging adjacent same scenes; and C) taking shot boundary marks remaining after processing in step B) as slice marks of the news video, and taking a frame sequence between adjacent shot boundary marks as the slice fragment to generate the multiple slice segments.
4. The intelligent cataloging method for all-media news based on multi-modal information fusion understanding according to claim 1, wherein in step 7), the news description text obtained in step 3) is taken as a main feature, and the voice text obtained in step 4), the subtitle bar text obtained in step 5), and the character information text obtained in step 6) are taken as auxiliary features to input into the generative model of multi-modal fusion.
5. The intelligent cataloging method for all-media news based on multi-modal information fusion understanding according to claim 1, wherein in step 7), a process of inputting into a generative model of multi-modal fusion for processing specifically comprises: inputting the news description text, the voice text, the subtitle bar text, and the character information text into an embedding layer trained through news corpus text to transform the text into semantic feature vectors, then mapping these vectors to a unified semantic space through a unified mapping layer respectively, then passing the vectors in the unified semantic space to a news semantic fusion layer for fusion understanding to obtain news fusion features with redundant information eliminated, and finally using the news fusion features to generate the comprehensive cataloging descriptions and a criticality of the news keywords through a trained text decoding layer.
6. The intelligent cataloging method for all-media news based on multi-modal information fusion understanding according to claim 5, wherein in step 7), the generative model of multi-modal fusion uses the following formulas:
text embedding:V.sub.x=x.sub.1v.sub.1i+x.sub.2v.sub.2+ . . . +x.sub.nv.sub.n, in the formula, x is one-hot encoding of embedded text based on an embedded dictionary, and n is a dimension of the embedded dictionary; if x.sub.i is a non-zero bit of x, v.sub.i is a vector row of the text corresponding to the embedded dictionary; and V.sub.x is a vector after the text is embedded; unified mapping:
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0074]
[0075]
[0076]
[0077]
DETAILED DESCRIPTION OF THE EMBODIMENTS
[0078] An intelligent cataloging method for all-media news based on multi-modal information fusion understanding includes: an automated process of multi-modal information intelligent cataloging for all-media news reports, and a process of fusing multi-modal news information to generate news keywords and comprehensive cataloging descriptions. The automated process of multi-modal information intelligent cataloging for all-media news reports includes: a fast video slicing and classification method for news scenes, a process of performing automatic video description for news reports, voice recognition for the news reports, news subtitle bar recognition, and news character matching on the segmented fragments, and a process of fusing multi-modal news information to generate comprehensive cataloging information.
[0079] The process of fusing the multi-modal news information to generate the news keywords and the comprehensive cataloging descriptions includes: news fragment image information, news fragment voice information, news fragment subtitle bar information, and news fragment character information are taken as input; multi-modal features in the news content are converted into semantic text and mapped to a unified semantic space for fusion; and the news keywords and comprehensive news cataloging descriptions are generated based on the news features in the unified space.
[0080] The fast video slicing and classification method for news scenes includes: shot boundary frames and news scene key frames are quickly located based on the difference between frames; visual features are extracted based on news scene key frame images for fast scene classification label determination; and adjacent shot fragments with a high overlap rate of scene classification labels are merged to obtain video slice (strip) fragments that meet the requirements of news cataloging.
[0081] As shown in
[0082] Step 1: original news video is processed into a set of static image frames, a histogram difference between each frame and the previous frame is calculated, a window range and a window moving step are set, and a queue N of possible frames of a shot boundary is set, and is initially an empty set. 10 frames is taken as the window range and 8 frames are taken as the step. Starting from an initial frame of the video, the following process is repeated: the frame with the maximum difference is searched in the current window, and a step distance between this frame and the frame last added to the queue N is determined. If the distance is greater than a preset minimum shot length, the frame is added to the queue N. All frames between two shot boundary frames are taken as the shot fragment (an i-th shot fragment is recorded as D.sub.i, and i is a fragment number starting from 1), and a middle frame of each shot fragment is extracted as the scene key frame of the fragment (the scene key frame in the fragment D.sub.i is recorded as k.sub.i).
[0083] Step 2: a news scene classification image data set is built, common scene labels in news reports such as “studio”, “conference site”, and “outdoor connection” are set for the images, and the residual network for the classification task of news scenes is trained. Visual features of each scene key frame k.sub.i in step 1 are extracted through a trained residual network, and the classification labels for news scenes with the highest matching degree are obtained by inferring.
[0084] Step 3: based on the scene classification labels of each fragment obtained in step 2, adjacent same scenes are merged. A process specifically includes: if an overlap rate of a scene classification label of k.sub.i and a scene classification label of k.sub.i−1 is greater than a preset threshold (set to 0.5 in the present disclosure), a shot boundary mark between the fragments D.sub.i and D.sub.i−1 is deleted, and a union of the scene classification labels of the two is taken as a new classification label of the merged fragment.
[0085] Step 4: shot boundary marks remaining after processing in step 3 are taken as slice marks of the news video, and a frame sequence between adjacent shot boundary marks is taken as the slice fragment.
[0086] Step 5: based on the slice fragments in step 4, video descriptions of each fragment are generated through a trained news video cataloging description model. A process of training the news video cataloging description model specifically includes: the news video is manually sliced into fragments of a single scene, the fragments are subjected to manual cataloging description, the fragments are taken as input features, the description text corresponding to the fragments is taken as target output, and the model is iteratively trained by taking reduction of differences between actual output and the target output of the model as a task goal. A process of inferring of the news video cataloging description model specifically includes: the fragments are input into the model, the visual features of the fragments are extracted through a convolutional neural network module in the model, and then these visual features are passed to an LSTM network module of the model to generate natural language text describing the news content.
[0087] Step 6: based on an audio stream of the slice fragments in step 4, audio features are extracted and converted through voice recognition technology to generate voice text.
[0088] Step 7: image frames are extracted from the slice fragments in step 4 at intervals of the number of frames generated in one second (that is, extracting one frame at an interval of 1 second). Then, the subtitle bar text is extracted through a trained convolutional neural network for a text recognition task in the image based on the extracted image frames. Finally, the extracted text is compared and deduplicated, and the final subtitle recognition text is output.
[0089] Step 8: image frames are extracted from the slice fragments in step 4 at intervals of the number of frames generated in one second (that is, extracting one frame at an interval of 1 second). Then, the facial features are extracted through a trained convolutional neural network for a facial recognition task in the image based on the extracted image frames. Then, the extracted facial features are matched with facial features in the news character database. If the similarity reaches a preset threshold (set to 0.72 in the present disclosure), then the character is set as a successfully matched character, and finally information text of several matched successfully characters that are not repeated is output.
[0090] Step 9: the description text obtained in step 5 is taken as a main feature, and the news scene classification labels obtained in step 2, the voice text obtained in step 6, the subtitle text obtained in step 7, and the character information text obtained in step 8 are taken as auxiliary features. Redundant information is eliminated through the generative model of multi-modal fusion for the news content as shown in
text embedding:V.sub.x=x.sub.1v.sub.1i+x.sub.2v.sub.2+ . . . +x.sub.nv.sub.n,
[0091] In the formula, x is one-hot encoding of embedded text based on an embedded dictionary, and n is a dimension of the embedded dictionary; if x.sub.i is a non-zero bit of x, v.sub.i is a vector row of the text corresponding to the embedded dictionary; and V.sub.x is a vector after the text is embedded.
[0092] unified mapping:
[0093] In the formula, A, b, and f(−) represent a weight matrix, an offset vector, and an activation function of the mapping layer respectively; k is a dimension of an input vector x; m is a vector dimension of a unified domain after mapping; and a.sub.i,j is a weight coefficient of an i-th row and j-th column of the matrix A, and b.sub.i is a coefficient of a vector in the order of i in the vector b.
[0094] semantic fusion:
[0095] In the formula, x.sub.i is a vector of a mode i in the unified semantic space, and w.sub.i is a news semantic weight coefficient corresponding to x.sub.i; and A, b, and f(−) represent a weight matrix, an offset vector, and an activation function of a final layer of the fusion layer respectively.
[0096] Text decoding: the process is implemented by stacking multiple LSTM networks:
[0097] L.sub.1=LSTM.sub.1(R)
[0098] L.sub.i+1=LSTM.sub.i+1(L.sub.i)
[0099] C(L.sub.i)=f (L.sub.i;W,b)
[0100] Output.sub.text=[O.sub.I.sub.
[0101] Output.sub.criticality[C(L.sub.1),C(L.sub.2),C(L.sub.3), . . . ],
[0102] In the formula, R is a feature vector after fusion; LSTM.sub.i(−) is function representation of an i-th LSTM network having feature output of L.sub.i and text output of O.sub.I.sub.
[0103] Step 10: information related to cataloging knowledge in steps 1 to 9 is assembled, and output as data of a structure including {“original video id”, “fragment sequence id”, “start and end time of fragment”, “automatic description text”, “automatic character recognition”, “automatic scene classification”, “subtitle recognition text”, “voice recognition text”, and “automatic news keyword” }, and stored in a database. Steps 1 to 10 fully realize the automated process of intelligent cataloging for the news video.
[0104] As shown in