SYSTEM AND METHOD FOR PROVIDING INTERACTIVE STORYTELLING

20220103874 · 2022-03-31

    Inventors

    Cpc classification

    International classification

    Abstract

    A system for providing interactive storytelling includes an output device configured to output storytelling content to a user, wherein the storytelling content includes one or more of audio data or visual data, a playback controller configured to provide storytelling content to the output device, one or more sensors configured to generate measurement data by capturing an action of the user, an abstraction device configured to generate extracted characteristics by analyzing the measurement data, an action recognition device configured to determine a recognized action by analyzing a time behavior of the measurement data and/or the extracted characteristics. The playback controller is additionally configured to interrupt provision of storytelling content, to trigger the abstraction device and/or the action recognition device to determine a recognized action, and to continue provision of storytelling content based on the recognized action. A corresponding method, a computer program product, and a computer-readable storage medium are also disclosed.

    Claims

    1. A system for providing interactive storytelling, comprising: an output device configured to output storytelling content to a user, wherein the storytelling content includes one or more of audio data or visual data, a playback controller configured to provide the storytelling content to the output device, one or more sensors configured to generate measurement data by capturing an action of the user, an abstraction device configured to generate extracted characteristics by analyzing the measurement data, and an action recognition device configured to determine a recognized action by analyzing a time behavior of the measurement data and/or the extracted characteristics, wherein the playback controller is additionally configured to interrupt provision of the storytelling content, to trigger the abstraction device and/or the action recognition device to determine a recognized action, and to continue provision of the storytelling content based on the recognized action.

    2. The system according to claim 1, additionally comprising a comparator configured to determine a comparison result by comparing the recognized action with a predetermined action, wherein the comparison result is input to the playback controller.

    3. The system according to claim 1, additionally comprising a cache memory configured to store the measurement data and/or the extracted characteristics, wherein the action recognition device uses the measurement data and/or extracted characteristics stored in the cache memory when analyzing the respective time behavior.

    4. The system according to claim 1, wherein the one or more sensors comprise one or more of a camera, a microphone, a gravity sensor, an acceleration sensor, a pressure sensor, a light intensity sensor, or a magnetic field sensor.

    5. The system according to claim 1, wherein the one or more sensors comprise a microphone, the measurement data comprise audio recordings, and the extracted characteristics comprise one or more of a melody, a noise, a sound, or a tone.

    6. The system according to claim 1, wherein the one or more sensors comprise a camera, the measurement data comprise pictures, and the extracted characteristics comprise a model of the user or a model of a part of the user.

    7. The system according to claim 1, wherein the abstraction device and/or the action recognition device comprise a Neural Network.

    8. The system according to claim 7, wherein the Neural Network is trained using a training optimizer, wherein the training optimizer is based on a fitness criterion optimized by gradient descent on an objective function.

    9. The system according to claim 1, wherein a data optimizer is connected between the abstraction device and the action recognition device, wherein the data optimizer is based on energy minimization using a Gauss-Newton algorithm, and wherein the data optimizer improves data output by the abstraction device.

    10. The system according to claim 1, additionally comprising a memory storing data supporting the playback controller at providing the storytelling content, wherein the playback controller is configured to load data stored in the memory, and wherein the playback controller is additionally configured to output loaded data to the output device as the storytelling content or to adapt loaded data to the recognized action.

    11. The system according to claim 1, wherein the output device comprises one or more of a display, a sound generator, a vibration generator, or an optical indicator.

    12. The system according to claim 1, wherein the system is optimized for being executed on a mobile device.

    13. A method for providing interactive storytelling, comprising: providing, by a playback controller, storytelling content to an output device, wherein the storytelling content includes one or more of audio data or visual data, outputting, by the output device, the storytelling content to a user, interrupting provision of the storytelling content, capturing, by one or more sensors, an action of the user, thereby generating measurement data, analyzing the measurement data by an abstraction device, thereby generating extracted characteristics, analyzing, by an action recognition device, a time behavior of the measurement data and/or the extracted characteristics, thereby determining a recognized action, and continuing provision of the storytelling content based on the recognized action.

    14. A computer program product comprising executable instructions which, when executed by a hardware processor, cause the hardware processor to execute the method according to claim 13.

    15. A non-transitory computer-readable storage medium comprising executable instructions which, when executed by a hardware processor, cause the hardware processor to execute the method according to claim 13, wherein the executable instructions are optimized for being executed on a mobile device.

    16. The system according to claim 3, wherein the cache memory is configured to store the measurement data and/or the extracted characteristics for a predetermined time.

    17. The system according to claim 7, wherein the Neural Network is a Convolutional Neural Network (CNN), a Long Short Term Memory (LTSM), and/or a Transformer Network.

    18. The system according to claim 8, wherein the training optimizer is based on an Adam optimizer.

    19. The system according to claim 12, wherein the system is optimized for being executed on a smartphone or a tablet.

    20. The non-transitory computer-readable storage medium according to claim 15, wherein the executable instructions are optimized for being executed on a smartphone or a tablet.

    Description

    BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

    [0048] In connection with the explanation of the preferred embodiments of the disclosure by the aid of the drawings, generally preferred embodiments and further developments of the teaching will be explained. In the drawings:

    [0049] FIG. 1 shows a block diagram of an embodiment of a system according to the present disclosure,

    [0050] FIG. 2 shows a flow diagram of an embodiment of a method according to the present disclosure, and

    [0051] FIG. 3 shows a picture of a user of the system with an overlaid model of the user.

    DETAILED DESCRIPTION

    [0052] FIG. 1 shows a block diagram of an embodiment of a system 1 according to the present disclosure. The system 1 is implemented on a smartphone and comprises an output device 2, a playback controller 3, two sensors 4, 5, an abstraction device 6, and an action recognition device 7. The playback controller 3 is connected to a memory 8, which stores data used for providing storytelling content. In this example, memory 8 stores storytelling phrases, i.e., bits of storytelling content, after which an action is anticipated, respectively. The storytelling phrases may be a couple of 10 seconds long, e.g., 20 to 90 seconds. The playback controller 3 loads data from memory 8 and uses the loaded data for providing storytelling content to the output device 2. The storytelling content comprises audio and visual data, in this case a recording of a narrator reading a text, sounds, music, and pictures (or videos) illustrating the read text. To this end, the output device comprises a loudspeaker and a video display. The output device outputs the storytelling content to a user 9.

    [0053] At the end of a storytelling phrase, the playback controller triggers the abstraction device 6 and the action recognition device 7 (indicated with two arrows) and the user 9 is asked to perform a particular action, e.g., stretching high to reach a kitten in a tree, climbing up a ladder, making a meow sound, singing a calming song for the kitten, etc. It is also possible that the playback controller triggers the abstraction device 6 and the action recognition device 7 while or before outputting a storytelling phrase to the output device 2. By continuously monitoring the user 9, the system can react more directly to an action performed by the user. The system can even react to an unexpected action, e.g., by outputting “Why are you waving at me all the time?”

    [0054] The sensors 4, 5 are configured to capture the action performed by the user. Sensor 4 is a camera of the smartphone and sensor 5 is a microphone of the smartphone. Measurement data generated by the sensors 4, 5 while capturing the action of the user are input to a cache memory 10 and to the abstraction device 6. The abstraction device 6 analyzes received measurement data and extracts characteristics of the measurement data. The extracted characteristics are input to the cache memory 10 and to the action recognition device 7. The cache memory 10 stores received measurement data and received extracted characteristics. In order to support analysis of the time behavior, the cache memory 10 may store the received data for predetermined periods or together with a time stamp.

    [0055] A data optimizer 11 is connected between the abstraction device 6 and the action recognition device 7. The data optimizer 11 is based on a Gauss-Newton algorithm. Depending on the anticipated action captured by the sensors 4, 5, the action recognition device 7 can access the data stored in the cache memory 10 and/or data optimized by data optimizer 11. This optimized data might be provided via the cache memory 10 or via the abstraction device 6. The action recognition device 7 analyzes the time behavior of the extracted characteristics and/or the time behavior of the measurement data in order to determine a recognized action. The recognized action is input to a comparator 12, which classifies the recognized action based on an anticipated action stored in an action memory 13. If the recognized action is similar to the anticipated action, the comparison result is input to the playback controller 3. The playback controller will provide storytelling content considering the comparison result.

    [0056] The abstraction device 6 and the action recognition device 7 can be implemented using a Neural Network. An implementation of the system using a CNN—Convolutional Neural Network—or a LTSM—Long Short Term Memory—produced good results. It should be noted that the following examples just show Neural Networks that have proven to provide good results. However, it should be understood that the present disclosure is not limited to these specific Neural Networks.

    [0057] Regarding the abstraction device 6 and with reference to analyzing measurement data of a camera, i.e., pictures, the Neural Network is trained to mark a skeleton of a person in a picture. This skeleton forms characteristics according to the present disclosure and a model of the user. The Neural Network learns associating an input picture with multiple output feature maps or pictures. Each keypoint is associated with a picture with values in the range [0 . . . 1] at the position of the keypoint (for example eyes, nose, shoulders, etc.) and 0 everywhere else. Each body part (e.g., upper arm, lower arm) is associated with a colored picture encoding its location (brightness) and its direction (colors) in a so-called PAF—Part Affinity Field. These output feature maps are used to detect and localize a person and determine its skeleton pose. The basic concept of such a skeleton extraction is disclosed in Z. Cao: “Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields,” CVPR, Apr. 14, 2017, https://arxiv.org/pdf/1611.08050.pdf and Z. Cao et al.: “OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields,” IEEE Transactions on Pattern Analysis and Machine Intelligence, May 30, 2019, https://arxiv.org/pdf/1812.08008.pdf.

    [0058] As operation of the Neural Networks might result in the need of high computing power, the initial topology can be selected to suit a smartphone. This may be done by using the so-called “MobileNet” architecture, which is based on “Separable Convolutions.” This architecture is described in A. Howard et al.: “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,” Apr. 17, 2017, https://arxiv.org/pdf/1704.04861.pdf; M. Sandler et al.: “MobileNetV2: Inverted Residuals and Linear Bottlenecks,” Mar. 21, 2019, https://arxiv.org/pdf/1801.04381.pdf; A. Howard et al.: “Searching for MobileNetV3,” Nov. 20, 2019, https://arxiv.org/pdf/1905.02244.pdf.

    [0059] When training the Neural Network, an Adam optimizer with a batch size between 24 and 90 might be used. The Adam optimizer is described in D. Kingma, J. Ba: “ADAM: A Method for Stochastic Optimization,” conference paper at ICLR 2015, https://arxiv.org/pdf/1412.6980.pdf. For providing data augmentation, mirroring, rotations +/−xx degrees (e.g., +1-40°) and/or scaling might be used.

    [0060] During inference, a data optimizer based on the Gauss-Newton algorithm can be used. This data optimizer avoids extrapolation and smoothing of the results of the abstraction device.

    [0061] The extracted characteristics (namely the skeletons) or the results output by the data optimizer can be input to the action recognition device for estimating the performed action. Actions are calculated based on snippets of time, e.g., 40 extracted characteristics generated in the most recent two seconds. The snippets can be cached in cache memory 10 and input to the action recognition device for time series analysis. A Neural Network suitable for such an analysis is described in B. Shaojie et al.: “An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling,” Apr. 19, 2018, https://arxiv.org/pdf/1803.01271.pdf.

    [0062] FIG. 2 shows a flow diagram of an embodiment of a method according to the present disclosure. In stage 14, storytelling content is provided to an output device 2 by the playback device 3, wherein the storytelling content includes one or more of audio data and visual data. In stage 15, the output device 2 outputs the storytelling content to the user 9. In stage 16, provision of storytelling content is interrupted. In stage 17, an action of the user 9 is captured by one or more sensors 4, 5, thereby generating measurement data. The measurement data are analyzed in stage 18 by an abstraction device 6, thereby generating extracted characteristics. In stage 19, the action recognition device 7 analyzes the time behavior of the measurement data and/or the extracted characteristics, thereby determining a recognized action. In stage 20, provision of storytelling content is continued based on the recognized action.

    [0063] FIG. 3 shows a picture of a camera of an embodiment of the system according to the present disclosure. The picture shows a user 9, that stands in front of a background 21 and performs an action. A skeleton 22 forming extracted characteristics or a model of the user 9 is overlaid in the picture.

    [0064] Referring now to all figures, the system 1 can be used in different scenarios. One scenario is an audiobook with picture and video elements designed for children and supporting their need for movement. The storytelling content might refer to a well-known hero of the children. When using such a system, the playback controller 3 might provide, for instance, a first storytelling phrase telling that a kitten climbed up a tree, is not able to come down again, and is very afraid of this situation. The child is asked to sing a calming song for the kitten. After telling this, the playback controller might interrupt provision of storytelling content and trigger the abstraction device and the action recognition device to determine a recognized action. Sensor 5 (a microphone) generates measurement data reflecting the utterance of the child. The abstraction device 6 analysis the measurement data and the action recognition device 7 determines, what action is performed by the captured utterance. The recognized action is compared with an anticipated action. If the action is a song and might be calming for the kitten, the next storytelling phrase might tell that the kitten starts to relax and that the child should continue a little more.

    [0065] The next storytelling phrase might ask to stretch high for helping the kitten down. Sensor 4 (a camera) captures the child and provides the measurement data to the abstraction device 6 and the action recognition device 7. If the recognized action is not an anticipated action, the next storytelling phrase provided by the playback controller might ask to try it again. If the recognized action is “stretching high,” for example, the next storytelling phrase might ask for trying a little higher. If the child also performs this anticipated action, the next storytelling phrase might tell that the kitten is saved. The different steps might be illustrated by suitable animations. This short story shows how the system according to the present disclosure might operate.

    [0066] Many modifications and other embodiments of the disclosure set forth herein will come to mind to the one skilled in the art to which the disclosure pertains having the benefit of the teachings presented in the foregoing description and the associated drawings. Therefore, it is to be understood that the disclosure is not to be limited to the specific embodiments disclosed and that modifications and other embodiments are intended to be included within the scope of the appended claims. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.

    LIST OF REFERENCE SIGNS

    [0067] 1 system [0068] 2 output device [0069] 3 playback controller [0070] 4 sensor [0071] 5 sensor [0072] 6 abstraction device [0073] 7 action recognition device [0074] 8 memory (for storytelling content) [0075] 9 user [0076] 10 cache memory [0077] 11 data optimizer [0078] 12 comparator [0079] 13 action memory [0080] 14-20 stages of the method [0081] 21 background [0082] 22 extracted characteristics (skeleton)