System & Method for Body Language Interpretation
20210350118 · 2021-11-11
Inventors
Cpc classification
G06V10/454
PHYSICS
G10L17/26
PHYSICS
International classification
Abstract
A system and method for reading and interpreting a wide range of nonverbal communicative cues, to include facial expression, pose, gesture, posture, and voice intonation. The output of this system is a scale between zero and one—with the scale indicating the interpretation of the nonverbal communication and accompanying text describing the interpretation. The system determines how a person intends to react and determines whether the person's pronouncements are true or false.
Claims
1. A method for evaluating the veracity of statements made by a human subject, comprising: (a) obtaining said statements from said human subject; (b) while said human subject is providing said statements, obtaining digital images of said human subject; (c) providing said digital images to a computing system; (d) using said computing system to determine a body pose of said subject from said digital images; (e) using said computing system to identify images of body parts of said subject; (f) using said computing system to assign, based on said body pose and said images of said body parts, values for stress, disagreement, comfort, nervousness, insecurity, and anxiety; and (g) providing said values to a human operator.
2. The method for evaluating the veracity of statements as recited in claim 1, wherein said body parts include a face, hands, arms, feet, and legs.
3. The method for evaluating the veracity of statements as recited in claim 1, wherein said digital images include video images.
4. The method for evaluating the veracity of statements as recited in claim 1, said digital images include still images.
5. A method for evaluating the veracity of statements made by a human subject, comprising: (a) questioning said human subject in order to obtain said statements from said human subject, said statement including responses to said questions; (b) while said human subject is providing said statements, obtaining digital images of said human subject; (c) providing said digital images to a computing system; (d) using said computing system to determine a body pose of said subject from said digital images; (e) using said computing system to identify images of body parts of said subject; (f) using said computing system to assign, based on said body pose and said images of said body parts, values for stress, disagreement, comfort, nervousness, insecurity, and anxiety; and (g) providing said values to a human operator.
6. The method for evaluating the veracity of statements as recited in claim 5, wherein said body parts include a face, hands, arms, feet, and legs.
7. The method for evaluating the veracity of statements as recited in claim 5, wherein said digital images include video images.
8. The method for evaluating the veracity of statements as recited in claim 5, wherein said digital images include still images.
Description
BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS
[0010]
[0011]
[0012]
[0013]
REFERENCE NUMERALS IN THE DRAWINGS
[0014] 10 body language interpretation system [0015] 12 subject [0016] 14 CCTV [0017] 16 camera [0018] 18 handheld device [0019] 20 device interface [0020] 22 computing system [0021] 24 display [0022] 26 printer [0023] 28 cloud server [0024] 30 keyboard [0025] 32 mouse [0026] 34 device interface [0027] 36 system memory [0028] 38 central processing units [0029] 40 graphical processing units [0030] 42 video memory [0031] 44 system bus [0032] 46 output peripheral interface [0033] 48 network interface [0034] 50 user input interface [0035] 52 memory interface [0036] 54 memory interface [0037] 56 hard disk drive [0038] 58 processing system [0039] 60 data preprocessing [0040] 62 feature extractor [0041] 64 fully connected layers [0042] 66 pose representation [0043] 68 body part identification [0044] 70 convolutional and pooling layers [0045] 72 convolutional and pooling layers [0046] 74 convolutional and pooling layers [0047] 76 recurrent neural networks [0048] 78 feature vector
DETAILED DESCRIPTION OF THE INVENTION
[0049] An innovative objective of the present invention is to identify all nonverbal cues and interpret them using cameras. This will include facial expression, pose, gesture, and posture. The output of the inventive system is preferably a scale between zero and one, indicating the interpretation of the body language. The types of body language that are preferably recognized by the inventive system include: stress, confidence, disagreement, discomfort, concentration, insecurity, fear, concern, nervousness, and anxiety. These characteristics include two ends of a spectrum. For example, if a person is either extremely comfortable or extremely uncomfortable, this body language is presented with a probability of discomfort. Thus, if the probability is 1, the person is extremely uncomfortable and if it is zero, the person is extremely comfortable.
[0050] The significance of the developed system is in deciding how people intend to react and whether their concsious pronouncements (such as verbal statements or written statements) are true or false. Because nonverbal communication includes 65% of the information transmitted in interpersonal communication, the system can be used against criminals, terrorists, and spies to gather all types of information. The proposed system allows the observation of people during their interactions and interviews to validate their responses.
[0051] The inventive system consists of several components. The input to the system preferably, comprises static images and videos. Images and videos are obtained from camera live streams and files. The output of the inventive system is a set of scores indicating the level of each interpretation (such as stress, comfort, etc.). The inventive system preferably also outputs text describing the meaning of body language.
[0052] The following sections describes some of the components:
[0053] 1. Input Device: The input will be static images and videos obtained from smartphones, surveillance cameras, camcorders, files, and the like.
[0054] 2. Computing System: The preferred computing system is a device that processes all images and videos. It consists of several components including: RAM, ROM, HDD, CPUs, GPUs, video memory, user input interface, network interface, and output peripheral interface. These components are connected internally with the system bus. The computing system is connected to a cloud server via the network interface. When the computational power of the computing system is not enough, the computing system sends the data to the cloud server for processing.
[0055] 3. Processing System: The processing system is deployed on the computing system and is executed by the computing system components, the cloud server, and its peripherals. The processing system includes three main components: These are the Data Preprocessing Component, the Feature Extractor, and the Fully Connected Layers.
[0056] The Data Preprocessing Component cleans the input images and videos. Its functions include denoising, adjusting the light, cropping, and separating the human from other objects in the scene.
[0057] The Feature Extractor extracts features from the images and videos that make it possible to recognize the body language. To improve the accuracy of the system, the present invention preferably uses several feature extractors. The first feature extractor finds a representation of the human pose. The body pose conveys a lot of information about body language. The pose of a human subject is estimated, and the latent representation of poses will be used to interpret the body language of the subject. The latent representation of the pose is used in the classifier. The second feature extractor identifies body parts (faces, hands, arms, feet, and legs) and passes the images of these regions through several convolutional and pooling layers. This feature extractor identifies facial expression, hand gesture, etc. The third feature extractor gets the whole image or video frame as the input and passes it through several convolutional and pooling layers to extract features. The fourth feature extractor is specific to videos. It passes the frames of the videos through several convolutional and pooling layers, and then pass their outputs through recurrent neural networks to aggregate their features. The output of these four feature extractors are vectors that are stacked together to form a larger vector.
[0058] The Fully Connected Layers map all extracted features to the body language. It includes several layers of fully connected neural networks. The input is a vector of features (the larger vector from the feature extractor) and the output is a vector of probabilities. Each entity of the output vector indicates the probability of the associate body language meaning. For example, one entity indicates the probability of the person being stress.
[0059] Deep neural networks are used to find the latent representations of hand gestures, arms and legs position, and facial expressions. The latent representations are fed into the classifier for merging with other components.
[0060] As for voice intonation, deep neural networks (recurrent and convolutional neural networks) will be used for feature extraction from an individual's speech. The features will be merged with the pose, gesture, and other components for interpreting body language.
[0061] A fully connected neural network will be used to merge the latent representations of all components and interpret the body language cues in images, videos, and voice. The output will be a set of values between zero and one indicating the probability of possible interpretations (such as stress, discomfort, disagreement, etc.) and the text describing these interpretations.
[0062] Those skilled in the art will realize that the present invention can be implemented in a wide variety of ways.
[0063]
[0064]
[0065] Memory interfaces 52, 54 write and read data to storage devices such as hard disk drive 56. User input interface 50 provides access to typical user input devices such as a mouse 32 and keyboard 30. Device interface 34 provides access for other devices.
[0066] Network interface 48 provides bidirectional data exchange with cloud server 28. When the local computational power is exceeded the inventive system preferably transfer some of its computational needs to cloud server 28.
[0067]
[0068]
[0069] The body part identification module identifies body parts (faces, hands, arms, feet, legs, etc.) and passes the image of each specific region through several convolutional and processing layers 74. Additional convolutional and processing layers 70 processes the whole image or video frame and extracts features. The fourth convolutional and processing layer 72 is specific to video. It passes the frames of the videos through several convolutional and pooling layers and passes the outputs through recurrent neural networks 76 to aggregate the features. The outputs of the four extractors 66,68,70,72 are vectors that are stacked together to create feature vector 78.
[0070] The inventive system can be applied to many methods. An exemplary method is disclosed in the following scenario. A human subject is interviewed and asked to give verbal or written responses to questions The inventive system is monitoring the human subject during the interviews and used to validate the response the subject has given or call it into doubt. The process can be described as follows:
[0071] (1) A human subject is asked to respond to questions;
[0072] (2) While the human subject is giving responses, input devices are gathering still images, video images, and/or audio of the human subject;
[0073] (3) The data gathered are processed through a computing system in order to provide a set of scores indicating the status of the human subject as to stress, disagreement, comfort nervousness, insecurity, and anxiety; and
[0074] (4) Displaying said set of scores to a human system operator.
[0075] The preceding description contains significant detail regarding the novel aspects of the present invention. It should not be construed, however, as limiting the scope of the invention but rather as providing illustrations of the preferred embodiments of the invention. Thus, the scope of the invention should be fixed by the claims ultimately drafted, rather than by the examples given.