Method and apparatus for generating story from plurality of images by using deep learning network
11544531 · 2023-01-03
Assignee
Inventors
- Byoung-tak Zhang (Seoul, KR)
- Min-Oh Heo (Seoul, KR)
- Taehyeong Kim (Anyang-si, KR)
- Seon Il Son (Gimpo-si, KR)
- Kyung-Wha Park (Seoul, KR)
Cpc classification
G11B27/28
PHYSICS
G06F18/214
PHYSICS
G06V10/454
PHYSICS
G06V30/194
PHYSICS
International classification
G06V10/44
PHYSICS
Abstract
Disclosed herein are a visual story generation method and apparatus for generating a story from a plurality of images by using a deep learning network. The visual story generation method includes: extracting features from a plurality of respective images by using the first extraction unit of a deep learning network; generating the structure of a story based on the overall feature of the plurality of images by using the second extraction unit of the deep learning network; and generating the story by using outputs of the first and second extraction units.
Claims
1. A visual story generation method for generating a story from a plurality of images by using a deep learning network, the visual story generation method comprising: extracting features from a plurality of respective images by using a first extraction unit of a deep learning network, the first extraction unit being implemented by at least one processor; generating a structure of a story based on an overall feature of the plurality of images by using a second extraction unit of the deep learning network, the second extraction unit being implemented by the at least one processor; and generating the story by using outputs of the first and second extraction units, wherein generating the structure of the story comprises: inputting the extracted features of the plurality of respective images to the second extraction unit including bidirectional long short-term memory (LSTM) of two or more layers; extracting, by the second extraction unit, the overall feature of the plurality of images; understanding, by the second extraction unit, context based on the overall feature; and generating, by the second extraction unit, the structure of the story based on the understood context.
2. The visual story generation method of claim 1, wherein generating the story comprises generating the story based on the generated structure of the story and generating sentences by connecting pieces of information between sentences included in the story.
3. The visual story generation method of claim 2, wherein generating the story is performed by applying a cascading mechanism such that a hidden value output by each sentence generator included in a story generation module implemented by the at least one processor and configured to generate the sentences is input to a subsequent sentence generator.
4. The visual story generation method of claim 1, wherein extracting the features from the plurality of respective images comprises extracting features from the plurality of respective images by using a convolution neural network.
5. A non-transitory computer-readable storage medium having stored thereon a program that performs the method set forth in claim 1.
6. A visual story generation apparatus for generating a story from a plurality of images by using a deep learning network, the visual story generation apparatus comprising: an input/output unit configured to receive a plurality of images from an outside, and to output a story generated from the plurality of images; a storage unit configured to store a program for generating a story from a plurality of images; and at least one processor that implements a deep learning network by executing the program, wherein the at least one processor implements: a first extraction unit configured to extract features of the plurality of respective images; a second extraction unit configured to generate a structure of the story based an overall feature of the plurality of images; and a story generation module configured to generate the story by using outputs of the first extraction unit and the second extraction unit, wherein the second extraction unit includes bidirectional long short-term memory (LSTM) of two or more layers, and wherein the second extraction unit receives the features of the plurality of respective images extracted by the first extraction unit, extracts the overall feature of the plurality of images, understands context based on the overall feature, and generates the structure of the story based on the understood context.
7. The visual story generation apparatus of claim 6, wherein the story generation module generates the story based on the generated structure of the story and generates sentences by connecting pieces of information between sentences included in the story.
8. The visual story generation apparatus of claim 7, wherein: the story generation module comprises a plurality of sentence generators; and a cascading mechanism is applied to the story generation module such that a hidden value output by each of the plurality of sentence generators is input to a subsequent sentence generator.
9. The visual story generation apparatus of claim 6, wherein the first extraction unit is implemented using a convolution neural network.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) The above and other objects, features, and advantages of the present invention will be more clearly understood from the following detailed description taken in conjunction with the accompanying drawings, in which:
(2)
(3)
(4)
(5)
(6)
DETAILED DESCRIPTION
(7) Various embodiments will be described in detail below with reference to the accompanying drawings. The following embodiments may be modified to various different forms and then practiced. In order to more clearly illustrate the features of the embodiments, detailed descriptions of items that are well known to those having ordinary skill in the art to the following embodiments pertain will be omitted. In the drawings, portions unrelated to the following description will be omitted. Throughout the specification, similar reference symbols will be assigned to similar portions.
(8) Throughout the specification and the claims, when one component is described as being “connected” to another component, the one component may be “directly connected” to the other component or “electrically connected” to the other component through a third component. Furthermore, when any portion is described as including any component, this does not mean that the portion does not exclude another component but means that the portion may further include another component, unless explicitly described to the contrary.
(9) The embodiments will be described in detail below with reference to the accompanying drawings.
(10)
(11) Referring to
(12) When visual cues 1 for a plurality of images are input to the deep learning network 100, the deep learning network 100 generates and outputs a corresponding story. The following description will be given on the assumption that when N sequential images V are input to the deep learning network 100, N sentences S corresponding thereto are generated.
(13) The first extraction unit 10 extracts the features of a plurality of respective images, and transmits the extracted N features X to the second extraction unit 20. According to an embodiment, the first extraction unit 10 may be implemented using a CNN suitable for the learning of 2D data.
(14) Meanwhile, although an illustration is made in
(15) The second extraction unit 20 is a component configured to generate the structure of a story based on the overall feature of the plurality of images. For this purpose, the second extraction unit 20 may include the bidirectional long short-term memory (LSTM) of two or more layers. Although an example in which the second extraction unit 20 includes the bidirectional LSTM of two layers is shown in
(16) The second extraction unit 20 receives the N features X from the first extraction unit 10, and outputs information C related to the overall feature of the plurality of images. In
(17) The respective pieces of information that are output by the first layer 21 and the second layer 22 will be described in detail below.
(18) The first layer 21 extracts the overall feature of the plurality of images, and transmits the extracted overall feature to the second layer 22 and the aggregation unit 30.
(19) The second layer 22 receives the overall feature of the plurality of images, understands context indicated by the plurality of images, generates the structure of a story based on the understood context, and transmits the structure of the story to the aggregation unit 30.
(20) As described above, the outputs of the first layer 21 and the second layer 22 are combined into the information C and then input to the aggregation unit 30. Furthermore, the features X output by the first extraction unit 10 are also input to the aggregation unit 30.
(21) The aggregation unit 30 aggregates the N features X and the N pieces of information C, and outputs N vectors H to the story generation module 40.
(22) The story generation module 40 generates a plurality of sentences based on the received N vectors H. The plurality of sentences generated by the story generation module 40 as described above is not only based on the structure of the story output by the second layer 22 of the second extraction unit 20, but also reflects the features of the plurality of respective images output by the first extraction unit 10 and the overall feature of the plurality of images output by the first layer 21 of the second extraction unit 20.
(23) As described above, according to an embodiment, a story that is highly related to the images and is naturally developed overall may be generated by taking into consideration both the overall feature of the plurality of images and features unique to the respective images via two information channels.
(24) Meanwhile, according to an embodiment, in order to increase the coherence of the plurality of sentences generated by the story generation module 40, a cascading mechanism may be applied to the story generation module 40.
(25) Applying the cascading mechanism refers to generating sentences by connecting pieces of information between the sentences generated by the story generation module 40. For this purpose, the hidden values of a plurality of sentence generators included in the story generation module 40 may be sequentially connected.
(26) For example, the hidden value of the first sentence generator included in the story generation module 40 is initialized to 0, and a hidden value output from each of the sentence generators is input to a subsequent sentence generator.
(27) As described above, the effect of increasing the coherence of the overall story may be expected by connecting the pieces of information between the plurality of sentences.
(28) The above-described deep learning network 100 shown in
(29) Referring to
(30) The input/output unit 210 is a component configured to receive a plurality of images from the outside and to output a story generated from the plurality of images. For example, the input/output unit 210 may include wired/wireless communication ports such as a USB port and a Wi-Fi module, input devices such as a keyboard and a mouse, and output devices such as a monitor.
(31) The control unit 220 is a component including at least one processor such as a central processing unit (CPU), and is configured to implement a deep learning network and to perform operations required to generate a story from a plurality of images by driving the deep learning network. The control unit 220 may perform these operations by executing a program stored in the storage unit 23.
(32) The storage unit 230 is a component configured to store a file, a program, etc. and may be constructed via various types of memory. In particular, the storage unit 230 may store a program configured to generate a story from a plurality of images, and the control unit 220 may implement a deep learning network by executing the program. Furthermore, the storage unit 230 may store a plurality of sequential images that is used as the input of the deep learning network.
(33) A visual story generation method for generating a story from a plurality of images by using the above-described deep learning network 100 and the above-described visual story generation apparatus 200 will be described below.
(34) Referring to
(35) At step 302, the structure of a story based on the overall feature of the plurality of images is generated using the second extraction unit 20 of the deep learning network 100.
(36) Referring to
(37) At step 402, the first layer 21 of the second extraction unit 20 extracts the overall feature of the plurality of images, and transmits the extracted overall feature to the second layer 22.
(38) The second layer 22 understands the context indicated by the plurality of images based on the overall feature extracted at step 403, and generates the structure of a story based on the understood context at step 404.
(39) The structure of the story output at step 404 is aggregated with the overall feature of the plurality of images extracted at step 402 and the features of the plurality of respective images extracted at step 301 by the aggregation unit 30, and an aggregation result is transmitted to the story generation module 40.
(40) At step 303, the story generation module 40 generates a story by using the output of the first extraction unit 10 and the second extraction unit 20.
(41) The story generation module 40 may apply a cascading mechanism in order to maintain the coherence of sentences that are generated.
(42) In greater detail, pieces of information between the sentences generated by the story generation module 40 may be connected by sequentially connecting the hidden values of a plurality of sentence generators included in the story generation module 40.
(43) For example, the hidden value of the first sentence generator included in the story generation module 40 is initialized to 0, and a hidden value output from each of the sentence generators is input to the next sentence generator.
(44) As described above, a story that is highly related to the images and has natural overall development may be generated by taking into consideration both the overall feature of the plurality of images and features unique to the respective images via two information channels.
(45) Furthermore, the effect of increasing the coherence of the overall story may be expected by connecting the pieces of information between the plurality of sentences.
(46)
(47) In area 510 of
(48) When the sentences shown in area 510 and the sentences shown in area 520 are compared with each other, the sentences shown in area 520 reflect the context indicated by the plurality of images overall and the flow of a story is developed accordingly, whereas the sentences shown in area 510 are felt individually written without continuity therebetween.
(49)
(50) Referring to the pluralities of images and sentences shown in
(51) The term “unit” used herein means software or a hardware component such as a field-programmable gate array (FPGA) or application-specific integrated circuit (ASIC), and a “unit” performs a specific role. However, a “unit” is not limited to software or hardware. A “unit” may be configured to be present in an addressable storage medium, and also may be configured to run one or more processors. Accordingly, as an example, a “unit” includes components, such as software components, object-oriented software components, class components and task components, processes, functions, attributes, procedures, subroutines, segments in program code, drivers, firmware, microcode, circuits, data, a database, data structures, tables, arrays, and variables.
(52) Components and a function provided in “unit(s)” may be coupled to a smaller number of components and “unit(s)” or divided into a larger number of components and “unit(s).”
(53) In addition, components and “unit(s)” may be implemented to run one or more CPUs in a device or secure multimedia card.
(54) The visual story generation method according to the embodiment described in conjunction with
(55) Furthermore, the visual story generation method according to the embodiment described in conjunction with
(56) Accordingly, the visual story generation method according to the embodiment described in conjunction with
(57) In this case, the processor may process instructions within a computing apparatus. An example of the instructions is instructions that are stored in memory or a storage device in order to display graphic information for providing a Graphic User Interface (GUI) onto an external input/output device, such as a display connected to a high-speed interface. As another embodiment, a plurality of processors and/or a plurality of buses may be appropriately used along with a plurality of pieces of memory. Furthermore, the processor may be implemented as a chipset composed of chips including a plurality of independent analog and/or digital processors.
(58) Furthermore, the memory stores information within the computing device. As an example, the memory may include a volatile memory unit or a set of the volatile memory units. As another example, the memory may include a non-volatile memory unit or a set of the non-volatile memory units. Furthermore, the memory may be another type of computer-readable medium, such as a magnetic or optical disk.
(59) In addition, the storage device may provide a large storage space to the computing device. The storage device may be a computer-readable medium, or may be a configuration including such a computer-readable medium. For example, the storage device may also include devices within a storage area network (SAN) or other elements, and may be a floppy disk device, a hard disk device, an optical disk device, a tape device, flash memory, or a similar semiconductor memory device or array.
(60) The above-described embodiments are intended merely for illustrative purposes. It will be understood that those having ordinary knowledge in the art to which the present invention pertains can easily make modifications and variations without changing the technical spirit and essential features of the present invention. Therefore, the above-described embodiments are illustrative and are not limitative in all aspects. For example, each component described as being in a single form may be practiced in a distributed form. In the same manner, components described as being in a distributed form may be practiced in an integrated form.
(61) According to at least any one of the above-described embodiments, a story that is highly related to the images and is naturally developed overall may be generated by taking into consideration both the overall feature of a plurality of images and features unique to the respective images via two information channels.
(62) Furthermore, the coherence of a story may be maintained by connecting pieces of information between generated sentences by applying a cascading mechanism to the story generation module configured to generate sentences.
(63) The effects which may be acquired by the disclosed embodiments are not limited to the above-described effects, and other effects that have not been described above will be clearly understood by those having ordinary knowledge in the art, to which the disclosed embodiments pertain, from the foregoing description.
(64) The scope of the present invention should be defined by the attached claims, rather than the detailed description. Furthermore, all modifications and variations which can be derived from the meanings, scope and equivalents of the claims should be construed as falling within the scope of the present invention.