SIGN LANGUAGE VIDEO SEGMENTATION METHOD BY GLOSS FOR SIGN LANGUAGE SENTENCE RECOGNITION, AND TRAINING METHOD THEREFOR
20220415009 · 2022-12-29
Assignee
Inventors
- Han Mu PARK (Seongnam-si, KR)
- Jin Yea JANG (Suwon-si, KR)
- Yoon Young JEONG (Seoul, KR)
- Sa Im SHIN (Seoul, KR)
Cpc classification
G06V10/84
PHYSICS
International classification
G06V10/42
PHYSICS
G06V10/774
PHYSICS
G06V10/84
PHYSICS
Abstract
There are provided a method for segmenting a sign language video by gloss to recognize a sign language sentence, and a method for training. According to an embodiment, a sign language video segmentation method receives an input of a sign language sentence video, and segments the inputted sign language sentence video by gloss. Accordingly, there is suggested a method for segmenting a sign language sentence video by gloss, analyzing various gloss sequences from the linguistic perspective, understanding meanings robustly in spite of various changes in sentences, and translating sign language into appropriate Korean sentences.
Claims
1. A sign language video segmentation method comprising the steps of: receiving an input of a sign language sentence video; and segmenting the inputted sign language sentence video by gloss.
2. The method of claim 1, wherein the step of segmenting comprising the steps of: segmenting the inputted sign language sentence video into a plurality of video segments; with respect to the segmented video segments, estimating segmentation probability distributions indicating whether the video segments should be segmented; and confirming whether the video segments are segmented and confirming a segmentation position, based on the estimated segmentation probability distributions of the video segments, and generating a video sequence which is segmented by gloss.
3. The method of claim 2, wherein the video segment overlaps other video segments in part.
4. The method of claim 2, wherein the step of estimating comprises estimating the segmentation probability distribution expressing whether each video segment should be segmented, by a probability distribution according to time.
5. The method of claim 4, wherein the step of estimating comprises estimating the segmentation probability distribution by using an AI model.
6. The method of claim 5, wherein the AI model is trained by using training video segments which are generated from a training sign language sentence video dataset in which a start point and an end point of a sign language sentence video and start points and end points of respective glosses are specified by labels.
7. The method of claim 5, wherein the AI model is trained by using training video segments which are generated from a virtual sign language sentence video, the virtual sign language sentence video being generated by connecting gloss videos constituting a training gloss video dataset established on a gloss basis.
8. The method of claim 2, wherein the step of generating comprises the steps of: detecting whether each video segment is segmented and a segmentation position, based on the estimated segmentation probability distribution; and generating a video sequence on a gloss basis with reference to the detected segmentation position.
9. The method of claim 8, wherein the step of detecting comprises the steps of: generating one probability distribution by collecting the estimated respective segmentation probability distributions of the video segments; and detecting whether the video segments are segmented and the segmentation position from the collected probability distributions.
10. A sign language video segmentation system comprising: an input unit configured to receive an input of a sign language sentence video; and a segmentation unit configured to segment the inputted sign language sentence video by gloss.
11. A sign language video segmentation method comprising the steps of: training an AI model which receives an input of a sign language sentence video and outputs a video sequence which is segmented by gloss; and segmenting an inputted sign language sentence video by gloss by using the trained AI model.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0022] For a more complete understanding of the present disclosure and its advantages, reference is now made to the following description taken in conjunction with the accompanying drawings, in which like reference numerals represent like parts:
[0023]
[0024]
[0025]
[0026]
[0027]
DETAILED DESCRIPTION
[0028] Hereinafter, the present disclosure will be described in more detail with reference to the accompanying drawings.
[0029] Since sign language is a separate language that has a different grammatical system from Korean language, there may be various combinations of gloss sequences corresponding to a meaning of the same Korean sentence. In order to translate gloss sequences of various combinations, a sign language sentence video is required to be analyzed on a gloss basis.
[0030] To achieve this, a technique for segmenting a sign language video by gloss is needed. Accordingly, embodiments of the present disclosure suggest a method for segmenting a sign language video by gloss, by automatically extracting gloss sections which are semantic units in the video, in order to recognize a sign language sentence. This may be approached by a method of detecting a boundary between glosses by utilizing an AI model.
[0031] In addition, embodiments of the present disclosure may provide self-generation of a plurality of training data from a real sign language sentence video dataset, and generation of virtual training data by virtually connecting gloss videos, as a solution to acquire plenty of training data for an AI model for segmenting a sign language sentence video.
[0032]
[0033] The gloss-based sign language video segmentation unit 100 receives an input of a sign language sentence video, and generates a video sequence that is segmented by gloss, from the inputted sign language sentence video.
[0034] The gloss-based sign language video segmentation unit 100 which performs the above-described function may include a segment generation unit 110, a segmentation determination unit 120, and a segmentation position confirmation unit 130.
[0035] The segment generation unit 110 receives the input of the sign language sentence video, and generates a plurality of video segments having a short length by segmenting the sign language sentence video into video segments having the same length.
[0036] The length of the video segments generated by the segment generation unit 110 may be arbitrarily determined, and each of the video segments may overlap other neighboring video segments in part as shown in
[0037] A range of overlapping in part is not limited to directly adjacent video segments. That is, as shown in
[0038] The segmentation determination unit 120 is an AI model that receives the video segments generated at the segment generation unit 110, and determines whether each of the video segments should be segmented, and outputs a segmentation probability distribution expressing a result of determination by a probability distribution according to time.
[0039]
[0040] The AI model of the segmentation determination unit 120 may be already trained by using training data which is provided through the segmentation determination model training unit 200, which will be described below.
[0041] The segmentation position confirmation unit 130 may confirm whether the video segments are segmented, based on the segmentation probability distributions regarding the video segments estimated by the segmentation determination unit 120, and may confirm a segmentation position regarding a video segment that is confirmed to be segmented. Accordingly, a video sequence which is segmented by gloss is generated at the segmentation position confirmation unit 130.
[0042] The segmentation position confirmation unit 130 performing the above-described function may include a probability distribution collection unit 131, a segmentation position detection unit 132, and a gloss video generation unit 133, as shown in
[0043] The probability distribution collection unit 131 generates one probability distribution by collecting all of the segmentation probability distributions regarding the respective video segments, which are estimated by the segmentation determination unit 120.
[0044] The segmentation position detection unit 132 may detect segmentation positions from the one probability distribution collected by the probability distribution collection unit 131. There may be two or more positions for segmenting according to an inputted sign language video. Of course, there may be one segmentation position or there may be no segmentation position.
[0045] The gloss video generation unit 133 may segment the sign language sentence video with reference to the segmentation positions detected by the segmentation position detection unit 132, and may generate/output the video sequence which is segmented by gloss.
[0046] Referring back to 1, the present disclosure will be described.
[0047] The segmentation determination model training unit 200 is a means for training the AI model which estimates the segmentation probability distribution at the segmentation determination unit 120 of the gloss-based sign language video segmentation unit 100, and may include a training data conversion unit 210 and a training data generation unit 200.
[0048] The training data conversion unit 210 generates training video segments of a defined length, from a ‘training sign language sentence video dataset’ in which start/end points of a sign language sentence video and each gloss are specified by labels.
[0049] The training data conversion unit 210 generates the training video segments in such a manner that, when a gloss change point is on a center of a video segment, it gives the video segment a highest probability.
[0050] The training data generation unit 220 generates a virtual sign language sentence video by connecting gloss videos of a ‘training gloss video dataset’ which is established on the gloss basis. In this case, the gloss videos may be directly connected without modification, or may be naturally connected through modification such as motion blending, etc.
[0051] Furthermore, other virtual sign language sentence videos may further be generated by randomly changing the order of gloss videos constituting the generated virtual sign language sentence video.
[0052] The virtual sign language sentence video generated at the training data generation unit 220 may be processed at the training data conversion unit 210 in the same way as a real sign language sentence video, and may be utilized for training.
[0053] Up to now, the gloss-based sign language video segmentation and training methods for recognition of the sign language sentence have been described in detail.
[0054] In the above-described embodiments, a method for segmenting a sign language sentence video by gloss by utilizing an AI model is suggested.
[0055] Accordingly, there is suggested a method for segmenting a sign language sentence video by gloss, analyzing various gloss sequences from the linguistic perspective, understanding meanings robustly in spite of various changes in sentences, and translating sign language into appropriate Korean sentences.
[0056] In addition to a method for self-generation of a plurality of training data for an AI model from one sign language video data, there is suggested a method of adding training data by virtually connecting gloss videos.
[0057] Accordingly, plenty of training data for the AI model for segmenting the sign language sentence video by gloss may be acquired.
[0058]
[0059] The sign language video segmentation system according to an embodiment may be implemented by a computing system which is established by including a communication unit 310, an output unit 320, a processor 330, an input unit 330, and a storage unit 350.
[0060] The communication unit 310 is a communication means for communicating with an external device and for accessing an external network. The output unit 320 is a display that displays a result of executing by the processor 330, and the input unit 340 is a user input means for transmitting a user command to the processor 330.
[0061] The processor 330 is configured to perform functions of the means constituting the sign language video segmentation system shown in
[0062] The storage unit 350 provides a storage space necessary for operations and functions of the processor 330.
[0063] The technical concept of the present disclosure may be applied to a computer-readable recording medium which records a computer program for performing the functions of the apparatus and the method according to the present embodiments. In addition, the technical idea according to various embodiments of the present disclosure may be implemented in the form of a computer readable code recorded on the computer-readable recording medium. The computer-readable recording medium may be any data storage device that can be read by a computer and can store data. For example, the computer-readable recording medium may be a read only memory (ROM), a random access memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical disk, a hard disk drive, or the like. A computer readable code or program that is stored in the computer readable recording medium may be transmitted via a network connected between computers.
[0064] In addition, while preferred embodiments of the present disclosure have been illustrated and described, the present disclosure is not limited to the above-described specific embodiments. Various changes can be made by a person skilled in the art without departing from the scope of the present disclosure claimed in claims, and also, changed embodiments should not be understood as being separate from the technical idea or prospect of the present disclosure.