INTERACTIVE SYSTEMS AND METHODS
20220172710 · 2022-06-02
Assignee
Inventors
- Peter Alistair BRADY (Stowmarket, Suffolk, GB)
- Hayden ALLEN-VERCOE (Hemyock, Devon, GB)
- Sathish SANKARPANDI (Ipswich, Suffolk, GB)
- Ethan DICKSON (Ipswich, Suffolk, GB)
Cpc classification
H04L51/02
ELECTRICITY
G16H50/20
PHYSICS
G06F3/167
PHYSICS
G10L2021/105
PHYSICS
G10L15/25
PHYSICS
G06N3/006
PHYSICS
H04L51/046
ELECTRICITY
International classification
G06V10/75
PHYSICS
G10L15/25
PHYSICS
Abstract
A method of producing an avatar video, the method comprising the steps of: providing a reference image of a person's face; providing a plurality of characteristic features representative of a facial model X0 of the person's face, the characteristic features defining a facial pose dependent on the person speaking; providing a target phrase to be rendered over a predetermined time period during the avatar video and providing a plurality of time intervals t within the predetermined time period; generating, for each of said times intervals t, speech features from the target phrase, to provide a sequence of speech features; and generating, using the plurality of characteristic features and sequence of speech features, a sequence of facial models Xt for each of said time intervals t.
Claims
1-24. (canceled)
25. A method of producing an avatar video, the method comprising the steps of: providing a reference image of a person's face; using the reference image to provide a plurality of characteristic features representative of an initial facial model X0 of the person's face, wherein the characteristic features of the initial facial model comprise at least one set of landmarks and at least one latent descriptor representing an abstract appearance feature, the characteristic features defining facial position and facial pose dependent on the person speaking; providing a target phrase to be rendered over a predetermined time period during the avatar video and providing a plurality of time intervals t within the predetermined time period; generating, for each of said times intervals t, speech features from the target phrase, to provide a sequence of speech features, the speech features representing abstract quantifiers of audio and linguistic information; and using a recursive model comprising a sequence-to-sequence encoder decoder method to generate, from the initial facial model X0 and the sequence of speech features, a sequence of expected facial models for each of said time intervals t, wherein physical spatio-temporal dynamics of a facial model at each of said time intervals t are generated by solving a system of ordinary differential equations, ODEs, an expected facial position being derived from a recursive transformation of the speech features and the facial position of the facial model at a current time interval of said time intervals, which is sampled and being combined with the characteristic features of the facial model Xt at the current time interval to obtain the characteristic features of a next facial model Xt+1 in the sequence of facial models; and combining and decoding the sequence of facial models Xt with the initial facial model X0 to generate a sequence of face images to produce the avatar video.
26. A method according to claim 25, wherein the target phrase is provided as text data and/or audio data.
27. A method according to claim 25, wherein at least one of said speech features comprises a phonetic label.
28. A method according to claim 25, wherein the speech features are extracted with a phonetic classifier module using a Deep Convolutional Network (DCN).
29. A method according to claim 25, wherein the at least one latent descriptor is extracted using a Deep Convolutional Network (DCN).
30. A method according to claim 25, wherein the recursive model is generated with a Long Short-Term Memory network.
31. A method according to claim 25, wherein generating the sequence of face images comprises using a frame generator to synthesize frames from the sequence of facial models Xt.
32. A method according to claim 31, wherein the frame generator comprises a discriminator module using at least one loss function for reducing differences between the reference image and each of the facial models Xt in said sequence of facial models Xt.
33. A method of producing an avatar video, the method comprising the steps of: providing a reference image of a person's face; providing a plurality of characteristic features representative of a facial model X0 of the person's face, the characteristic features defining a facial pose dependent on the person speaking; providing a target phrase to be rendered over a predetermined time period during the avatar video and providing a plurality of time intervals t within the predetermined time period; generating, for each of said times intervals t, speech features from the target phrase, to provide a sequence of speech features; and generating, using the plurality of characteristic features and sequence of speech features, a sequence of facial models Xt for each of said time intervals t, wherein the sequence of facial models Xt is generated using a recursive model.
34. A method according to claim 33, wherein the speech features are extracted with a phonetic classifier module using a Deep Convolutional Network (DCN).
35. A method according to claim 33, wherein the plurality of characteristic features comprises at least one Active Shape Model landmark, and at least one latent descriptor representing abstract appearance features.
36. A method according to claim 35, wherein the at least one latent descriptor is extracted using a Deep Convolutional Network (DCN).
37. A method according to claim 33, wherein the recursive model is comprises a sequence-to-sequence encoder decoder method.
38. A method according to claim 33, wherein the recursive model is generated with a Long Short-Term Memory network.
39. A method according to claim 33, wherein generating the sequence of face images comprises using a frame generator to combine the reference image with the sequence of facial models Xt.
40. A method according to claim 39, wherein the frame generator comprises a discriminator module using at least one loss function for reducing differences between the reference image and each of the facial models Xt in said sequence of facial models Xt.
41. A system for producing an avatar video, the method comprising the steps of: an image processing model for receiving a reference image of a person's face and for extracting a plurality of characteristic features representative of an initial facial model X0 of the person's face, wherein the characteristic features of the initial facial model comprise at least one landmark and at least one latent descriptor representing an abstract appearance feature, the characteristic features defining facial position and facial pose dependent on the person speaking; a speech processing module for extracting a target phrase to be rendered over a predetermined time period during the avatar video and for providing a plurality of time intervals t within the predetermined time period; the speech processing module configured to generate, for each of said times intervals t, speech features from the target phrase, to provide a sequence of speech features, the speech features representing abstract quantifiers of audio and linguisic information; and an avatar rendering module configured to use a recursive model comprising a sequence-to-sequence encoder decoder method to generate, from the initial facial model X0 and the sequence of speech features, a sequence of expected facial models for each of said time intervals t, wherein physical spatio-temporal dynamics of a facial model at each of said time intervals t are generated by solving a system of ordinary differential equations, ODEs, an expected facial position being derived from a recursive transformation of the speech features and the facial position of the facial model at a current time interval of said time intervals, which is sampled and being combined with the characteristic features of the facial model Xt at the current time interval to obtain the characteristic features of a next facial model Xt+1 in the sequence of facial models; wherein the avatar rendering module comprises a frame generator configured to combine and decode the sequence of facial models Xt with the initial facial model X0 to generate a sequence of face images to produce the avatar video.
42. A system for producing an avatar video, the method comprising the steps of: an image processing model for receiving a reference image of a person's face and for extracting a plurality of characteristic features representative of a facial model X0 of the person's face, the characteristic features defining a facial pose dependent on the person speaking; a speech processing module for extracting a target phrase to be rendered over a predetermined time period during the avatar video and for providing a plurality of time intervals t within the predetermined time period; the speech processing module configured to generate, for each of said times intervals t, speech features from the target phrase, to provide a sequence of speech features; and an avatar rendering module for generating, using the plurality of characteristic features and sequence of speech features, a sequence of facial models Xt for each of said time intervals t, wherein the sequence of facial models Xt is generated using a recursive model.
43. An interactive system for providing an answer to a user, the system comprising: a database comprising an indexed question library and a plurality of responses, wherein the plurality of responses comprise at least one avatar video produced using a system according to claim 42; a processing module for providing a correlation between the indexed question library and the plurality of responses; input means for receiving a question from the user as user input; wherein the processing module is configured to search keyword information in the indexed question library based on the user input; and providing at least one response to the user based on said correlation.
44. A healthcare information system comprising an interactive system according to claim 43.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0060] The disclosure will now be described with reference to and as illustrated by the accompanying drawings in which:
[0061]
[0062]
[0063]
[0064]
[0065]
[0066]
[0067]
[0068]
[0069]
[0070]
[0071]
[0072]
[0073]
DETAILED DESCRIPTION
Interactive Systems and Methods
[0074]
[0075] In this example, the search is an “elastic search” (https://en.wikipedia.org/wiki/Elasticsearch). Advantageously, an elastic search is distributed, providing a scalable, near real-time search. Each video is indexed and tagged with keyword tags relevant to the health topic they address. The search accuracy may be improved by including a function for determining synonyms of the keywords in addition to the assigned keywords themselves.
[0076]
[0077] In a preferred scenario, an avatar is presented to the user, prompting the user to ask their question(s). The user input 100 may be either spoken (via a microphone) or written. The system then converts the spoken or written sentences to high dimensional vector representations of the user input 100. This is done through neural architectures such as ‘word2vec’ (‘https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) or ‘glove’ (https://nlp.stanford.edu/pubs/glove.pdf) where the words having similar syntactical and semantic features are placed in proximity. The high dimensional representations of the user input 100 are used by the system to interrogate a symptoms database for example. A set of initial results is generated.
[0078] Next, an output in the form of an avatar video is fetched (or generated) based on the set of initial results. The output may include a question to the user to request further information based on the set of initial results if the AI has low confidence on initial results. Accordingly, the system is interactive and iterative. That is, the system continues to extract useful information from successive user inputs and uses this to interrogate the database in order to generate further, secondary queries and smaller, consecutive results subsets from the initial results set, until a single result or small enough subset of results with high confidence is arrived at. This may include a complete re-set so as to generate a fresh/new set of initial results if the subsequent user responses render this as necessary (e.g. if subsequent queries provide a null/empty subset).
[0079] In an example, avatar image sequences are generated offline, in non-real-time, for a given text or audio speech target. This process requires storing a number of similar reference frames to be used to generate the output sequence. More frames provide greater temporal coherence and video quality at the expense of increased computation.
[0080] In an alternative, preferred example, avatar sequences are generated on the fly. On the fly generation aims to generate video in real-time from only a single reference image and a known sequence of speech labels, provided as encoded audio sequences or from the text databases of information. The system also incorporates an active learning schema which learns actively based on the history of user inputs and AI responses, improving the AI's confidence to answer a user query/input continuously over time.
[0081] Preferred, but non-essential system capabilities include voice recognition, avatar personalisation (including voice/dialect personalisation) and personalisation/results focusing taking into account a user's preference or medical history.
[0082] With reference to
[0083] Preferably, the output 370 is in the form of concise, relevant answers within an avatar video. With reference to
[0084] The AI algorithm improves reliability over existing techniques, whilst the video realistic avatar enhances user experience. This is advantageous to using videos which must be shot using real persons and are therefore often lengthy and expensive to provide. Using avatars increases scalability and flexibility of applications.
Production of Speech Driven, Audio-Visual Avatars
[0085] The present section describes systems and methods according to aspects of the invention, used to create a digital avatar, using audio-visual processing for facial synthesis. From these, a database of digital avatars may be built to be used in the examples of interactive systems and methods provided above, and as will be further described with reference to
[0086] Advantageously, an interactive user interface may be therefore provided to a specialised chatbot that can answer healthcare questions. It will be appreciated, however, that the described systems and methods can also be used in standalone audio-visual processing algorithms for facial synthesis. The methods make use of modern machine learning and digital signal processing techniques.
[0087] The purpose of this aspect of the invention is to create 3-D facial models of a target subject (e.g. a doctor or nurse which a user may already be familiar with) to produce a hyper-realistic speech driven avatar of that target subject. In preferred embodiments, given a target phrase recorded as spoken by the target subject and reference appearance (e.g. an image of the subject), the system will provide videos of the target subject speaking the target phrase.
[0088] With reference to
[0089] The modular design of the system 41 enables the system to be operable in several configurations (modes), for example for online and offline usage. In offline mode photorealism and synchronicity are prioritised whereas online mode aims to achieve light-functionality to support mobile devices and video-streaming. Advantageously, the system 41 may be provided as a service platform, e.g. in combination with a digital platform 270/AI engine 280 as outlined in
[0090] Each module of system 41 comprises data pathway (data flow) and specialised processing.
[0091] The image processing module 60 is configured to extract a plurality of key descriptive parameters (descriptors) from the reference model of the target face (the ‘reference image’). The descriptive parameters may include characteristic features referred to as landmark points (landmarks) known from Active Shape Models (ASMs), as well as latent descriptors (vectors) representing abstract appearance features (such as colour, texture etc.). ASMs are statistical models of the shape of objects which iteratively deform to fit to an example of the object in a new image. The latent descriptors may be extracted using a pre-trained Deep Convolutional Network (DCN).
[0092] In alternative embodiments, where no reference appearance model is supplied (e.g. as a reference face image), pre-extracted parameters may be used instead, as available. Advantageously, subjective appearance features may thus be separated from general shape features which is dependent on speech (as changing whilst the target face is speaking).
[0093] Historically the parameters used are the location of key-points such as mouth corners, nose edges, etc. In existing parametric models, such as ASMs these are compressed with Principal Component Analysis (PCA) to reduce the dimensionality and standardize representations. The PCA encoded features can then be clustered into distinct modes (i.e. most frequent/dense distributions). These modes of variation capture common expressions and poses. The advantages of this approach are efficiency and relatively low computational time. The disadvantages of this approach are that each model is subjective, requiring large amounts of very similar data for accurate reconstruction, and that rendering new images from point models requires a separate process.
[0094] Active Appearance Models (AAMs) attempt to resolve this by parametrising texture maps of the image, however this a limiting factor. In contrast, the fully data-driven approach common in modern computer vision does not attempt to parameterise the subject model and instead is focused on producing images from the offset. This involves learning how pixels are typically distributed in an image. As such, the features are learned directly from the images and are more abstract—typically in the form of edges and gradients that describe low-level image data. A disadvantage is that these models are highly specific to the training task and may function unpredictably to new data. Further restrictions include a need to fix image resolution.
[0095] The speech processing model 50 receives an input target phrase. The input target phrase may be generated (e.g. by a chatbot backend) using Natural Language Processing. Alternatively, the input target phrase may be specified by a user.
[0096] This input target phrase 90 may be supplied as a text input and/or audio waveform for example. Where no audio recording is available the target phrase may be generated with Text-To-Speech (TTS) software. From the audio waveform, phoneme labels are preferably generated, with a phonetic classifier module 51, at pre-set time intervals—this advantageously provides a phoneme label for each video frame. A phoneme label (also referred to as a phonetic label) is a type of class label indicating fundamental sounds common in speech.
[0097] From the input target phrase, the speech processing model 50 extracts speech features and, optionally, phoneme labels. Speech features are defined as abstract quantifiers of audio information such as, but not limited to, short-time-frequency representations i.e. mel-frequency cepstral coefficients (MFCCs), per-frame local energy, delta coefficients, zero-cross rate etc.
[0098] An avatar rendering module 70 receives the extracted descriptive parameters from the image processing module 60 (which include landmarks) and the extracted speech features and phonetic labels from the speech processing module 50. The avatar rendering module 70 comprises a point model sequencer 71 which receives the descriptive parameters (point model) from the from the image processing module 60 and the extracted speech features and phonetic labels from the speech processing module 50.
[0099] The point model sequencer 71 preferably uses a recursive model (‘pose-point model’) to generate a sequence of landmarks giving the face position and pose at each time interval of the avatar video. A ‘pose’ refers to both the high-level positional information i.e. gaze direction, head alignment, as well as capturing specific facial features and expression. The recursive model is preferably based on Long Short-Term Memory networks (LSTMs), which are known as a special type of recurrent neural networks comprising internal contextual state cells that act as long-term or short-term memory cells. The output of the LSTM network is modulated by the state of these cells. This is an advantageous property when the prediction of the neural network is to depend on the historical context of inputs, rather than only on the very last input.
[0100] The avatar rendering module 70 further comprises a frame generating model 72 (‘frame generator’) which receives the output of the point model sequencer 71, that is, the sequence of landmarks giving the face position and pose at each time interval of the avatar video—additionally we colour code high level semantic regions such as lips, eyes, hair etc. The frame generator renders these into full frames using a specialised style-transfer architecture (as will be described below with reference to
[0101] System 41 further comprises a post-processing and video sequencer module 80 which receives the generated frames from the frame generator 72 of the avatar rendering module 70. Following ‘light’ post-processing such as image and temporal smoothing, colour correction, etc, module 80 encodes these frames together with a target audio input into an avatar video. The target audio input provided to the module 80 may be supplied or generated. In an example, the ‘Text-To-Speech’ capability of the speech processing module 50 is used to supply the target audio input to the module 80.
[0102] Turning to
[0103] At step 630, a landmark detector DCN extracts landmark points (landmarks) from the image out at step 620, which represent key parameters. This provides the point model to be input to the point model sequencer 71 of the avatar rendering module 70.
[0104] Separately (in parallel to step 630), an appearance encoder network is used, at step 640, to encode the image appearance features as an appearance vector. The appearance vector is input to the frame generator module 72 of the avatar rendering module 70.
[0105] Turning to
[0106] At step 510, feature extraction is performed using a speech classification algorithm as shown in
[0107] At steps 505, the audio input 90 is first re-sampled for example by decimation or frequency based interpolation to a fixed frame rate of 16 KHz. Following this, the signal is passed through an anti-aliasing filter (e.g. with 8 Khz cut-off). Pre-emphasis is performed for example with a simple high-pass filter to amplify higher frequencies better descriptive of speech. Finally, the signal is rms normalised and separated into short time frames synchronised to the video frame rate.
[0108] The feature extraction processing involves discrete Fourier transforms on these frames to obtain a spectrogram. The per-frame energy is extracted here. As the frequency is logarithmically scaled, higher frequencies are less impactful and as such can be grouped into energy bands. This is the inspiration behind the mel-cepstral spectrogram, wherein a filter bank is used to group frequencies into increasingly wider bands. This severely reduces dimensionality and increases robustness. The mel-frequencies are then passed through a discrete-cosine-transform (DCT-II) to provide the MFCCs. Post-processing can then be applied per-speaker to transform each feature to a normally distributed variable.
[0109] In this example, the speech classification algorithm is used to extract mel-frequency cepstral coefficient (MFCC) audio features and the time derivatives are linearly approximated with a 2nd order symmetric process. These features are then concatenated, at step 510, to give a local contextual window containing the speech features from time steps either side of the specific frame. This has the benefit of increasing the scope of each frame.
[0110] At step 520, phonetic labels are generated with the phonetic classifier module 51. In an example, a 1D Convolutional Network is used to provide “softmax” classifications of the predicted phoneme. This uses an autoencoder to predict the probability distribution across the phonetic labels for a given set of speech features. In addition, Bayesian inference may be applied by modelling a prior distribution of likely phonemes from the text-annotation to improve performance. At step 530, the output of this Network is a sequence of phoneme labels {P.sub.0, . . . P.sub.t, . . . P.sub.N} for each video frame interval.
[0111] Turning to
[0112] Turning to
[0113] Advantageously, a generalised face discriminator ensures realism. A face-discriminator takes single colour images and detects realism. Furthermore, a temporal coherence network may be used to score the neighbouring frames and pose errors. A temporal discriminator is a 2D convolutional encoder that takes a sequence of grayscale images stacked in the channel axis to score the relative temporal consistency. As such, this detects inconsistent movements between frames.
[0114]
[0115]
[0116] The speech recognition module 5000 transforms the audio input into a sequence of descriptors in a multi-stage sequence as exemplified in
[0117] The parametric model module 6000 is a temporal version of the physical models used in AAMs and similar. We estimate both a descriptive physical representation and the temporal dynamics as a function of speech. The process employed by the parametric model 6000 is outlined with reference to
[0118] The parametric model 6000 represents the physical dynamics of speech with a first order Ordinary Differential Equation (ODE). This allows the position of face vertices to change in response to speech. In the data flow an initial estimate is first extracted from a reference image—while not a necessary requirement, it is preferred that the initial image is frontally aligned, well-lit and in a neutral or resting pose. With the speech embeddings from the ASR network, the framewise derivatives for each vertex are estimated such that by adding these derivatives to the current model we arrive at the vertices positions at the next frame. This can be done auto-regressively for arbitrary length sequences at arbitrary frame rates to produce a temporal sequence of face poses and expressions.
[0119] As these physical models do not contain texture maps or high-resolution detail, rendering is done separately, in the frame renderer module 7000 as exemplified in
[0120] It will be appreciated that systems 41, 4100 as described above may be used in stand-alone applications outside healthcare, for example to provide avatars for any virtual environments, video-communications applications, video games, TV productions and advanced man-made user interfaces.
AI Systems and Methods for Interactive Health Care Systems
[0121] The present section describes systems and methods according to aspects of the invention for providing an AI module to be used in the examples of interactive systems and methods provided above, and particularly, in combination with the avatar database.
[0122] The purpose is to create a system architecture and process that can accurately and quickly answers questions posed by the user in the natural language. Advantageously, an interactive user interface may be therefore provided to a specialised chatbot that can accurately answer healthcare questions by a realistic avatar. The system is referred may be referred to as an ‘interactive healthcare system’. It will be appreciated, however, that the described systems and methods can also be used in standalone applications outside healthcare. The systems and methods make use of modern machine learning techniques.
[0123] With reference to
[0124] The answer(s) fetched from the database 280 may be presented to the user in the form of an output 380 as avatar video, or normal video or text based on availability. Preferably, the output 380 is in the form of concise, relevant answers within a realistic avatar video. The output 380 may be presented in any form, for example, provided on a computer screen, smartphone or tablet.
[0125] Turning to
[0126] The input 100 is then then provided to a processing sub-module 281 of the AI module 280. The processing module 281 processes machine and or deep learning algorithms. Before the input 100 is provided to the machine learning algorithm 281, the input is pre-processed with a pre-processing sub-module 282 (shown in the
[0127] With reference to
[0128] Once pre-processed, the input 100 is then provided to the machine learning algorithm of the processing module 280 for training and prediction. The machine learning algorithm used in this example is “Bi-LSTM” which represents a combination of Long Short-Term Memory (LSTM) and Bi-directional Recurrent Neural Networks (RNNs). As the name suggests, bi-directional RNNs are trained on both the forward and backward bass of a sequence simultaneously. In comparison the bi-directional LSTM is similar but also includes internal passing and forget gates allowing features to pass through long sequences more easily. Bi-LSTM is the special development of artificial neural networks to process sequence and time series data. It will be appreciated that the algorithm used will constantly evolve and that other suitable algorithms may be used.
[0129] A hierarchal set of Bi-LSTM algorithms forms the classification architecture of the processing module 280. Depending on number of categories answered, the classification system is divided. With reference to
[0130] With reference to
[0131] Once the answer is displayed as output 380, the user is requested for a feedback 385. An exemplary process of providing user feedback is shown in
[0132] To improve the performance of the system an active learning schema is implemented. An analysis is preferably carried out on the feedback data. For example, the feedback data is ‘yes’ in the case that the user is happy with the results obtained and ‘no’ otherwise. If the feedback data is ‘yes’ then the questions and answers are stored in a retraining database. The retraining database stores failure cases along with the response for review and model validation. If the feedback is ‘no’, then this is flagged for manual check and then added to the retraining database for algorithm retraining.
Applications and Interpretation
[0133] The foregoing examples and descriptions of embodiments of the present invention as described herewith may be implemented for example in GP triage rooms. However, the foregoing examples and descriptions of embodiments of the present invention have been presented only for purposes of illustration and description. They are not intended to be exhaustive or to limit the present invention to the forms disclosed. Accordingly, modifications and variations will be apparent to practitioners skilled in the art. In particular, it is envisaged that the search and machine learning principles may be applied to topics outside health care, such as sex education, product marketing and customer support and so on.
[0134] Further, the AI algorithms and avatars may be located on a client computing device. It will be understood however that not all of the logic for implementing the AI algorithms and/or avatar needs to be located on the client computing device and can be based on one or more server computer systems with a user interface being provided locally on the client computing device. Similarly, logic for implementing the avatar can be stored locally on the client computing device, while the information learned by the system (AI part) can be stored partially or entirely on one or more servers. The specific manner in which the AI algorithms and avatars are respectively hosted in not essential to the disclosure.
[0135] Those skilled in the art will further appreciate that aspects of the invention may be implemented in computing environments with many types of computer system configurations including personal computers, desktop computers, laptops, hand-held devices, multi-processor systems or programmable consumer electronics, mobile telephones, tablets and the like.