ELECTRONIC DEVICE, SYSTEM AND METHOD FOR PREDICTING THE PERFORMANCE OF AN INDIVIDUAL HUMAN DURING A VISUAL PERCEPTION TASK

Abstract

The invention relates to an electronic device (1) for predicting the visual perceptual task performance of an individual human. The electronic device is configured to: •receive an output of a first sensor device configured to measure the working memory load at the frontal cortex of the human, and •predict the visual perceptual task performance as a function of said sensor output. The invention further relates to a system and a method.

Claims

1. An electronic device predicting the visual perceptual task performance of an individual human, configured to: receive an output of a first sensor device configured to measure the working memory load at the frontal cortex of the human, predict the visual perceptual task performance as a function of said sensor output.

2. The electronic device according to claim 1, wherein the first sensor device comprises at least one functional near-infrared spectroscopy (fNIRS) sensor configured to be placeable in the human's head, specifically on the frontal part of the cortex.

3. The electronic device according to claim 1, wherein measuring the working memory load comprises measuring a change of concentration levels of oxygenated (HbO2) and/or deoxygenated haemoglobin (HHb) elicited by neuronal activation in the underlying brain tissue at the frontal cortex of the human.

4. The electronic device according to claim 1, configured to predict a decrease of the visual perceptual task performance, in case the measured working memory load increases.

5. The electronic device according to claim 1, configured to receive data records representing a perceptual load of a visual and dynamic scene perceivable by the human, and predict the visual perceptual task performance additionally as a function of said data records.

6. The electronic device according to claim 5, configured to predict a decrease of the visual perceptual task performance, in case the measured working memory load increases and at the same time the perceptual load does not increase.

7. A system for predicting the visual perceptual task performance of an individual human, comprising: a sensor system configured to produce data records of working memory load of at the frontal cortex of the human and/or of perceptual load values of a visual and dynamic scene perceivable by the human, and an electronic device according to claim 1.

8. The system according to claim 7, wherein the sensor system comprises: a first sensor device configured to measure the working memory load at the frontal cortex of the human, and/or a second sensor device configured to sense the perceptual load of a visual and dynamic scene perceivable by the human.

9. The system according to claim 7, wherein the second sensor device comprises a scene sensor sensing the visual scene and is configured to: extract a set of scene features from the sensor output, the set of scene features representing static and/or dynamic information of the visual scene, and determine the perceptual load of the set of extracted scene features based on a predetermined load model, wherein the load model is predetermined based on reference video scenes each being labelled with a load value.

10. The system according to claim 7, wherein the load model comprises a mapping function between sets of scene features extracted from the reference video scenes and the load values.

11. The system according to claim 7, wherein the load model is at least one of a regression model and a classification model between the sets of scene features extracted from the reference video scenes and the load values.

12. A vehicle comprising: an electronic device according to claim 1.

13. A method of predicting the visual perceptual task performance of an individual human, comprising the steps of: receiving an output of a first sensor device configured to measure the working memory load at the frontal cortex of the human, and predicting the visual perceptual task performance as a function of said sensor output.

14. The system according to claim 7, wherein the load model is configured to map a set of scene features to a perceptual load value.

15. A vehicle comprising: a system according to claim 7.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0094] FIG. 1 shows a block diagram of a system with an electronic device according to embodiments of the present disclosure;

[0095] FIG. 2 shows a schematic flow chart illustrating an exemplary method of determining the perceptual load according to embodiments of the present disclosure;

[0096] FIG. 3 shows a flow chart illustrating the exemplary method of FIG. 2 in more detail;

[0097] FIG. 4 shows a schematic diagram of dense trajectory extraction of a visual scene by dense trajectories;

[0098] FIG. 5 shows a diagram illustrating the C3D system architecture according to embodiments of the present disclosure;

[0099] FIG. 6 shows a schematic diagram illustrating the training of the load model according to embodiments of the present disclosure; and

[0100] FIG. 7 shows an example of the labelling procedure to compare a pair of reference video scenes, which is subsequently fed into the TrueSkill algorithm.

DESCRIPTION OF THE EMBODIMENTS

[0101] Reference will now be made in detail to exemplary embodiments of the disclosure, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts.

[0102] FIG. 1 shows a block diagram of a system 30 with an electronic device 1 according to embodiments of the present disclosure. In one example, the system 30 may be a vehicle 10. In another example, the system 30 is a test system 10 configured to predict the visual perceptual task performance of an individual human, i.e. a test person.

[0103] The electronic device 1 is connected to or comprises a data storage 2. Said data storage may be used to store data records recorded by a first sensor device (e.g. one or several fNIRS sensors) and data records of perceptual load values of a visual and dynamic scene perceived by a human, e.g. the driver. It may additionally store e.g. a load model. As described in the following, said load model maybe used to determine the perceptual load of the visual and dynamic scene.

[0104] The electronic device 1 may additionally carry out further functions in the system 30, e.g. in the vehicle 10. For example, the electronic device may also act as the general purpose ECU (electronic control unit) of the vehicle.

[0105] The electronic device 1 may comprise an electronic circuit, a processor (shared, dedicated, or group), a combinational logic circuit, a memory that executes one or more software programs, and/or other suitable components that provide the described functionality. In the example of a vehicle 10, the sensor 5 may be installed in the vehicle cabin, in order to measure the working memory load of the driver. In the example of the test system 10, it measures the working memory load of the test person.

[0106] The electronic device 1 may be connected to a sensor 5 in particular including at least one fNIRS sensor. The sensor 5 is configured to measure the working memory load, e.g. at the frontal cortex of a human. For example a plurality of fNIRS sensors may be placed on the head of a person, specifically on the frontal part of the cortex. Using a 3D computer model of the head, the relevant brain areas may be projected to locations on the head surface. These locations can then be found on the scalp by their relative distance to anatomical landmarks (i.e., nasion, left and right pre-auricular points, and innion). To validate the positioning of the fNIRS sensors on the person's head, the positions may be measured with a 3D digitizer, and projected on the brain in the 3D computer model.

[0107] The fNIRS signal may be recorded by the electronic device. To obtain the fNIRS signal changes related to neuronal processing of the working memory task, the signal may be filtered in several steps to remove low frequency changes (e.g. due to any slow movement of the fNRIS sensors on the scalp), mid-level frequency changes (due to heart rate activity and respiration), and/or high frequency changes (due to sudden movements of the fNIRS sensors on the scalp). The signal may then be analysed by a general linear model (GLM) that fits a model of the fNIRS signal changes related to the working memory task to the actual data, separately for each level of working memory load. The parameters estimated in the GLM may then be analysed by the electronic device 1.

[0108] The electronic device 1 may be further connected to an optical sensor 3 in particular a digital camera 3, at least in the example of the vehicle 10. The electronic device 1 and the digital camera may be comprised by a vehicle 10. The digital camera 3 is configured such that it can record a visual driving scene of the vehicle 10. The digital camera is desirably oriented in the driving direction of the vehicle, i.e. such that it records in particular the road in front of the vehicle. It is also possible to use several cameras 3. Accordingly, it may also be reasonable to use several sensors (e.g. cameras), in order to cover the complete field of view of the driver.

[0109] The output of the sensor 5 and/or the optical sensor 3, in particular a recorded video stream, is transmitted to the electronic device 1. Desirably, the output is transmitted instantaneously, i.e. in real time or in quasi real time. Hence, the measured working memory load and/or perceptual load of the recorded driving scene can also be determined by the electronic device in real time or in quasi real time.

[0110] The optical sensor 3 alone or the combination of optical sensor 3 and electronic device 1 may also form a second sensor device according to the present disclosure, i.e. a control device as described in WO2017211395 (A1).

[0111] In case of the example of the test system 10, the system may comprise a task generator 4 controllable by the electronic device 1, in particular instead of the optical sensor 3. The task generator 4 may be in one example a display indicating a predetermined task to be performed by the test person. Since in this case the task is predetermined, the perceptual load of the task, as it is perceivable by the test person, is known to the electronic device 1. For example, a motion perception task is presented which may comprise detecting the direction of motion that occurred for a short period, from among randomly moving field of dots. The findings show that the motion direction perception threshold (the minimum proportion of dots moving in the same direction) is higher under high than low working memory load, indicating that higher working memory load (which takes the test person's mind off the visual perception task), as detected with increased activity in lateral frontal brain regions (shown with the fNIRS) impaired motion perception.

[0112] The system 30 may comprise additionally a server 20. The server 20 is used to train and eventually update the load model. For this purpose, the electronic device 1 may be connectable to the server. For example the electronic device 1 may be connected to the server 20 via a wireless connection. Alternatively or additionally the electronic device 1 may be connectable to the server 20 via a fixed connection, e.g. via a cable.

[0113] FIG. 2 shows a schematic flow chart illustrating an exemplary method of determining the perceptual load according to embodiments of the present disclosure. The method may be carried out by the electronic device 1. The method comprises two steps: In the first step (step S2), a set of scene features is extracted from the video. In the second step (step S3), the load model providing a mapping function is applied. In other words, a mapping function between the sets of scene features and perceptual load values is applied.

[0114] In more detail, it is at first provided a record of a visual driving scene in step S1. As described above, the visual driving scene is recorded by a sensor, in particular a digital camera. From the output of the sensor (e.g. a video stream) fixed duration video snippets 101 (e.g. 2 second long clips) are taken. Hence, the video snippets may be processed in the method of FIG. 2 consecutively.

[0115] In step S2 a set of scene features 102 (also referred to as a scene descriptor) is extracted from the current video snippet 101. As described in more detail in the following, the set of scene features may be expressed by a feature vector.

[0116] In step S3 the set of scene features 102 is passed through the load model 103, which may be a regression model learnt from crowdsourcing. As a result a perceptual load value 104 indicating the perceptual load of the video snippet 102 is obtained.

[0117] The method of FIG. 2 may be repeated for every single video snippet.

[0118] The method of FIG. 2 may be may be obtained using different regression models.

[0119] The determination of the perceptual load may also be regarded as an estimation, as it is not necessarily completely precise.

[0120] FIG. 3 shows a flow chart illustrating the exemplary method of FIG. 2 in more detail. In particular, the set of extracted scene features is shown in more detail, as it will be described in more detail in the following.

[0121] The goal of scene feature extraction is to describe the content of a video in a fixed-length numerical form. A set of scene features may also be called a feature vector. The visual information of the driving scene contributes to determine the perceptual load by extracting appearance and motion features of the visual driving scene. In order to extract the visual information, improved dense trajectory (IDT) features and 3D convolutional (C3D) features are desirably extracted from the video snippet, as it is described below. Such features constituting a set of scene features are then passed through the load model, which may be a regression model, in order to calculate a perceptual load value indicating the perceptual load of the video snippet.

[0122] Improved Dense Trajectories (IDT)

[0123] In improved dense trajectories, videos are represented as visual features extracted around trajectories of primitive interest points. Trajectories are the tracked (x,y) image location of “interest points” over time. Such “interest points” may be parts of an image which are salient or distinct, such as corners of objects. The interest points may be detected using the SURF (“Speeded Up Robust Features”) algorithm and may be tracked by median filtering in a dense optical flow field of the video.

[0124] FIG. 4 shows a schematic diagram of dense trajectory extraction of a visual scene by dense trajectories. As shown, dense trajectories are extracted for multiple spatial scales, e.g. 4 to 8 spatial scales, and then local features are computed within a space-time volume around the trajectory. Such an action recognition by dense trajectories is also described in Wang, H. and Schmid, C. (2013): “Action recognition with improved trajectories”, IEEE International Conference on Computer Vision, Sydney, Australia, which disclosure is incorporated herein in its entirety. Spatial scales is commonly refers to the sampling for the trajectories. It means that the trajectories are sampled across the image with different numbers of pixels in between them. For example, at scale 1 there is a spacing of 5 pixels, at scale 2 there is a spacing of 10 pixels etc.

[0125] Histograms of Oriented Gradients (HOG), Histograms of Optical Flow (HOF), and Motion Bounded Histograms (MBH) features in the x- and y-directions are extracted around each trajectory, in addition to the Trajectory features themselves (i.e. the normalized x,y location of each trajectory).

[0126] A Bag of Words representation is desirably used to encode the features. In the Bag of Words representation, a 4000-length dictionary of each trajectory feature type (Trajectory, HOG, HOF, MBHx, MBHy) is learnt. That is, every possible feature type is quantized into a fixed vocabulary of 4000 visual words, and a video is then encoded as a histogram of the frequency of each type of visual word. This results in a 20,000 dimensional feature vector (i.e. 5×4000-length feature vectors).

[0127] Convolutional 3D (C3D) Features

[0128] Convolutional 3D (C3D) features are a type of “deep neural network” learnt feature where features are automatically learnt from labelled data. A hierarchy of video filters are learnt which capture local appearance and motion information. A C3D network for feature extraction must first be trained before it can be used. A pre-trained network can be used (i.e. it has been trained on other data, and learns to extract generic video descriptors). For example the pre-trained model may be trained from a set of a million sports videos to classify sports. This learns generic motion/appearance features which can be used in any video regression/classification task. Alternatively or additionally for the training the labelled reference videos may be used, in order to fine-tune a C3D network.

[0129] FIG. 5 shows a diagram illustrating the C3D system architecture according to embodiments of the present disclosure. In the diagram ‘Cony’ represents a layer of convolutional video filters; ‘Pool’ represents max-pooling which subsamples the convolution output; and ‘ FC’ represents a fully connected layer which maps weighted combinations of features to output values. The final set of scene features comprises 4096 dimensions and represents a weighted combination of video filters that represents the motion and appearance of the video snippet. Convolutional 3D (C3D) features are also described in Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri, M. (2015): “Learning spatiotemporal features with 3d convolutional networks”, IEEE International Conference on Computer Vision, pages 4489-4497, which disclosure is incorporated herein in its entirety.

[0130] Training the Load Model

[0131] FIG. 6 shows a schematic diagram illustrating the training of the load model according to embodiments of the present disclosure. The load model is desirably a regression model. To train the regression model, examples of various driving scenarios, i.e. in particular the reference video scenes, and their corresponding load values are required so that the machine-learning algorithm can learn a mapping function from sets of scene features to perceptual load values.

[0132] So called “ground-truth” perceptual load values may be acquired through crowd-sourcing, where test persons, e.g. experienced drivers, watch and compare clips of driving footage in a pairwise-comparison regime which are then converted to video ratings. Pairwise comparisons provide a reliable method of rating items (compared to people assigning their own subjective load value which would provide inconsistent labels). Desirably a system is used where experienced drivers would label the relative perceptual load of videos and select which video from a pair is more demanding on attention to maintain safe driving. The collection of pairwise comparisons is desirably converted to ratings for each video using the TrueSkill algorithm.

[0133] An alternative method could be a done by a driver and a passenger who manually tag live streams by load value (a level of 1 to 5 for example) while driving for a long distance. During this test, also the load model might be trained. Accordingly, the live streams may be used as reference video scene, with which the load model is trained.

[0134] FIG. 7 shows an example of the labelling procedure to compare a pair of reference video scenes, which is subsequently fed into the TrueSkill algorithm.

[0135] The TrueSkill model assumes that each video has an underlying true load value. The probability of one video being ranked as higher load than another is based on the difference in their load values. After each comparison between a pair of videos, the video load values are updated based on which video was labeled as having higher load and their prior load value. All videos start off as having equal load values, and are updated after each comparison. The videos are compared until their corresponding load values no longer change. The final result is a load value for each video. The TrueSkill algorithm is also described in Herbrich, R., Minka, T., and Graepel, T. (2006): “Trueskill: A Bayesian skill rating system”, Advances in Neural Information Processing Systems, pages 569-576, which disclosure is incorporated herein in its entirety.

[0136] In the following the development of the load model being a regression model is described. Regression takes a fixed length feature vector (i.e. a set of scene features) and learns a mapping function to transform this to a single continuous output value (i.e. the labelled perceptual load of the reference video). The regression function is learnt from labelled training examples of input (i.e. the feature vector) and output (i.e. the labelled perceptual load values) pairs, and finds the function that best fits the training data.

[0137] Various types of regression models can be used, e.g. linear regression, kernel regression, support vector regression, ridge regression, lasso regression, random forest regression etc.

[0138] In the simplest case of linear regression, the input scene feature vector x, which is effectively a list of numbers {x.sub.1, x.sub.2, x.sub.3, . . . , x.sub.N}, is mapped to the output y (in our case the perceptual load value) through a linear function y=f(x), where the function is a weighted sum of the input numbers:

f(x)=w.sup.Tx+b−that is f(x)=w.sub.1*x.sub.1+w.sub.2*x.sub.2+w.sub.3*x.sub.3 . . . +b.

[0139] This is equivalent to fitting a line of best fit to the input data points, and will learn the parameters w (these are simply weights assigned to each feature/value/number in the feature vector, x) and a bias term b, which centers the output at a particular value.

[0140] In a better performing model, multi-channel non-linear kernel regression is used. This extends linear regression to cover complex non-linear relationships between input sets of scene-features and output load values through using a “kernel”. This is a transformation of the input feature vectors to a space where they can be better separated or mapped. The mapping function becomes:

f(x)=w.sup.Tφ(x)+b.

[0141] Then, regression is run in the combined kernel space. This is similar to fitting a line to 2D points, but in high dimensional space: a machine-learning algorithm finds the collection of weights, w, which minimizes the error in the perceptual load estimate on a ‘training-set’ (i.e. a subset of the whole dataset, in this case two thirds of the ˜2000 video-load value pairs). This optimal set of weights therefore defines the mapping that best transforms the set of scene features to a single value indicating the perceptual load.

[0142] In this way the load model comprising the regression function can be trained based on the training examples. Once the regression function is learnt, the same procedure may be run, when the electronic device 1 is used in the vehicle. Accordingly, in use of the electronic device 1, an input scene descriptor (i.e. a set of scene features) is extracted from a visual driving scene, and the regression function is applied on the input scene descriptor (i.e. the set of scene features), in order to calculate the output load value.

[0143] After learning the model, any video can be inserted and a perceptual load value will be output for every 2-second segment. A “sliding window” approach is used to provide a continuous output the perceptual load value (i.e. a value can be output for every frame of the video). Of course, the segment may also be short or longer than 2 seconds.

[0144] Throughout the description, including the claims, the term “comprising a” should be understood as being synonymous with “comprising at least one” unless otherwise stated. In addition, any range set forth in the description, including the claims should be understood as including its end value(s) unless otherwise stated. Specific values for described elements should be understood to be within accepted manufacturing or industry tolerances known to one of skill in the art, and any use of the terms “substantially” and/or “approximately” and/or “generally” should be understood to mean falling within such accepted tolerances.

[0145] Although the present disclosure herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present disclosure.

[0146] It is intended that the specification and examples be considered as exemplary only, with a true scope of the disclosure being indicated by the following claims.

ELECTRONIC DEVICE, SYSTEM AND METHOD FOR PREDICTING THE PERFORMANCE OF AN INDIVIDUAL HUMAN DURING A VISUAL PERCEPTION TASK

Inventors

Cpc classification

Classification Explorer

A61B5/14553

HUMAN NECESSITIES

Classification Explorer

A61B5/18

HUMAN NECESSITIES

Classification Explorer

A61B5/7267

HUMAN NECESSITIES

Classification Explorer

A61B5/0261

HUMAN NECESSITIES

Classification Explorer

A61B5/6893

HUMAN NECESSITIES

Classification Explorer

G06V10/40

PHYSICS

Classification Explorer

B60W2540/221

PERFORMING OPERATIONS; TRANSPORTING

Classification Explorer

A61B5/6814

HUMAN NECESSITIES

Classification Explorer

B60W2420/40

PERFORMING OPERATIONS; TRANSPORTING

Classification Explorer

G06V10/764

PHYSICS

Classification Explorer

B60W50/0097

PERFORMING OPERATIONS; TRANSPORTING

Classification Explorer

G06V20/56

PHYSICS

Classification Explorer

B60W40/08

PERFORMING OPERATIONS; TRANSPORTING

International classification

Classification Explorer

A61B5/18

HUMAN NECESSITIES

Classification Explorer

A61B5/00

HUMAN NECESSITIES

Classification Explorer

A61B5/1455

HUMAN NECESSITIES

Classification Explorer

B60W40/08

PERFORMING OPERATIONS; TRANSPORTING

Classification Explorer

B60W50/00

PERFORMING OPERATIONS; TRANSPORTING

Classification Explorer

G06V10/40

PHYSICS

Classification Explorer

G06V10/764

PHYSICS

Classification Explorer

G06V20/56

PHYSICS

Abstract

Claims

Description