SPEECH PROCESSING METHOD FOR IDENTIFYING DATA REPRESENTATIONS FOR USE IN MONITORING OR DIAGNOSIS OF A HEALTH CONDITION
20230371889 · 2023-11-23
Inventors
Cpc classification
A61B5/4088
HUMAN NECESSITIES
G16H50/20
PHYSICS
A61B5/4803
HUMAN NECESSITIES
A61B5/7275
HUMAN NECESSITIES
G10L17/02
PHYSICS
International classification
A61B5/00
HUMAN NECESSITIES
G16H50/20
PHYSICS
G10L17/02
PHYSICS
Abstract
The invention relates to a computer-implemented method for identifying clinically meaningful representations of speech data for monitoring or diagnosis of a health condition, the method comprising: providing a main model comprising a trained neural network, trained to map an input representation encoding input speech data from a speaker to an output representation for use in providing a health condition prediction, the neural network comprising one or more internal network layers each comprising an internal representation which is passed to a subsequent network layer; inputting speech data from a speaker into the main model to form the internal representations of the input speech data; training a probe comprising a machine learning model, independently to the training of the main model, to map an internal representation of the input speech data an internal network layer of the main model to an independently determined measure of a clinically relevant feature of the input speech data or the speaker, where a clinically relevant feature is a property of the input speech or speaker that is impacted by a health condition. By training a probe externally to the main model, to map an internal representation to an independently determined measure of a clinically relevant feature, it is possible to identify associations within the internal representations that otherwise might not be found by the main model and to build improved representations based on these associations.
Claims
1. A computer-implemented method for identifying speech data representations for monitoring or diagnosis of a health condition, the method comprising: providing a main model comprising a trained neural network, trained to map an input representation encoding input speech data to an output representation for use in providing a health condition prediction, the neural network comprising one or more internal network layers each comprising a representation of the speech data which is passed to a subsequent network layer of the neural network, where the representations of the internal network layers are referred to as internal representations of the trained neural network; inputting speech data from a speaker into the main model to form the internal representations of the input speech data; and training a probe comprising a machine learning model, independently to the training of the main model, to map an internal representation of the input speech data to a measure of a clinically relevant feature of the input speech data or the speaker, where a clinically relevant feature is a property of the input speech or speaker that is impacted by a health condition.
2. The computer-implemented method of claim 1 wherein training the probe model independently to training of the main model comprises: fixing the main model after training and, in a separate training task, training the probe to map a fixed internal representation of the input speech data to the independently determined measure of a clinically relevant feature of the input speech data or the speaker.
3. The computer-implemented method of claim 1 wherein the main model is trained to map an input representation encoding input speech data to a health condition prediction.
4. The computer-implemented method of claim 1 wherein the measure of the clinically relevant feature of the input speech data or the speaker is determined independently of the main model.
5. The computer-implemented method of claim 1 comprising using the trained probe model to identify elements of the internal representation that: encode more information usable by the probe for predicting the clinically relevant feature relative to the remaining elements of the representation or other internal representations; and/or decouple from the remaining elements of the internal representation in predicting the clinically relevant feature.
6. The computer-implemented method of claim 5 wherein the elements of the internal representation are identified according to parameters of the machine learning model of the probe learnt during training, wherein the parameters preferably comprise one or more of weights, biases and activations learnt by the machine learning model of the probe.
7. The computer-implemented method of claim 5 wherein the identified elements are used to form speech data representations which are invariant to one or more of: speaker identity, speaker age, speaker gender.
8. The computer-implemented method of claim 5 wherein the main model has a plurality of internal network layers, the method comprising: training a probe for each of a plurality of the internal network layers to map the corresponding internal representation to the measure of the clinically relevant feature of the input speech data; and selecting one or more layers according to one or more of: (1) the accuracy of the prediction of the clinically relevant feature provided by the internal representation of the layer; (2) the degree to which certain elements of the layer decouple from remaining elements of the layer in making the prediction; (3) the size or complexity of the probe model required to provide a given prediction accuracy, (4) the amount of input speech data needed to train the probe to achieve a given prediction accuracy; and (5) minimum amount of data per example to perform the task.
9. The computer-implemented method of claim 5 wherein the main model comprises a supervised, unsupervised, self-supervised or semi-supervised model for making a health condition prediction, the method further comprising: inputting the identified elements of the internal representation into a machine learning model to determine a prediction of the health condition based solely on the identified elements associated with the clinically relevant feature.
10. The computer-implemented method of claim 1 wherein the probe comprises a linear model, multi-layer perceptron, an attention-based model or a Bayesian neural network.
11. The computer-implemented method of claim 1 wherein the method comprises: fixing the main model once trained; and subsequently training the probe model to map an internal representation of an internal network layer of the fixed main model to the independently determined measure of a clinically relevant feature.
12. The computer-implemented method of claim 1 wherein training the probe comprises: performing a principal components analysis on the internal representation of an internal network layer to provide a disentangled internal representation; and training the machine learning model of the probe to map the disentangled internal representation to the independently determined measure of a clinically relevant feature.
13. The computer-implemented method of claim 1 wherein the clinically relevant feature comprises one or more of: an objective property of the input speech, preferably a phonological, prosodic, lexico-semantic or syntactic property; a property of the speaker, preferably the speaker's score on a neuropsychological test; or a clinician's rating of the speech or speaker.
14. The computer-implemented method of claim 1 wherein providing the trained main model comprises: pre-training the main model, preferably using an unsupervised learning task on an unlabelled training data set; and performing task specific training on the pre-trained main model using a second training data set with labels associated with a specific health monitoring or diagnosis task, to provide the trained main model.
15. The computer-implemented method of claim 1 wherein the main model is trained using a loss function configured so as to encourage the model to learn disentangled internal representations.
16. The computer-implemented of claim 1 wherein the main model comprises a classifier or regression model trained to provide a health condition prediction based on the input representation of the input speech data, the method comprising: obtaining a measure of a plurality of clinically relevant features, each clinically relevant feature comprising a property of the speech or speaker which is impacted by the health condition predicted by the main model; and for each clinically relevant feature: applying a separate probe to each of a plurality of the internal network layers of the main model, and training all probes independently to map the corresponding internal representation to the measure of the clinically relevant feature; identifying one or more network layers by training a probe for each of a plurality of the internal network layers to map the corresponding internal representation to the measure of the clinically relevant feature of the input speech data and selecting one or more layers according to one or more of: (1) the accuracy of the prediction of the clinically relevant feature provided by the internal representation of the layer; (2) the degree to which certain elements of the layer decouple from remaining elements of the layer in making the prediction; (3) the size or complexity of the probe model required to provide a given prediction accuracy, (4) the amount of input speech data needed to train the probe to achieve a given prediction accuracy; and (5) minimum amount of data per example to perform the task; selecting elements of the corresponding internal representations of the selected network layers by encoding more information usable by the probe for predicting the clinically relevant feature relative to the remaining elements of the representation or other internal representations and/or decoupling from the remaining elements of the internal representation in predicting the clinically relevant feature; and combining the selected elements into one or more vectors.
17. The computer-implemented method of claim 16 further comprising: encoding input speech data into the one or more vectors; and inputting the vectors into the main model or another machine learning model to provide a health condition prediction.
18. The computer-implemented method of claim 1 wherein the health condition is related to one or more of a cognitive or neurodegenerative disease, motor disorder, affective disorder, neurobehavioral condition, head injury or stroke.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0055]
[0056]
[0057]
[0058]
[0059]
[0060]
[0061]
[0062]
DETAILED DESCRIPTION
Overview
[0063] The invention relates to a method for probing the internal network layers of a trained clinical predictive model to obtain additional information on why the network is making a health condition prediction and to identify new data representations for encoding speech data, not found by the model, which comprise further clinically relevant information, usable as biomarkers for monitoring or diagnosis of a health condition.
[0064]
[0065] The input representation preferably comprises a feature vector, i.e. a vector encoding the input speech data into a format usable by the main model. At each layer of the neural network the received representation undergoes transformation by the application of the weights and activations at each node of the layer such that each layer 10 outputs a representation R, which is a transformed representation of the previous layer, to the subsequent network layer. By training the model to make a health condition prediction, the model 100 learns to adjust the parameters applied to the representations, such as the weights and activations, at each layer so that the input representation is progressively transformed through a series of internal representations to finally reach an output representation encoding information within the input speech data that is associated with the particular health condition and can be used to make the prediction.
[0066] In the example of
[0067] In prior art methods, generally the content of the representations of the internal network layers is unknown and the model is understood to be working effectively if it provides a reliable prediction based on labelled training data. However, it is often not certain why the model is making a certain prediction—which part of the complex information within the input speech data the model is using—and whether all of the rich information within the input speech is being utilised, particularly without any additional clinical understanding being provided to the model other than the target output.
[0068]
[0069] The probe model 30 comprises a machine learning model, such as a simple classifier or regression model, which is trained in an adjacent task, separately to the main model, to map an internal representation Rn of an internal network layer 13 of the main model 100 to the clinically relevant feature of the input speech data 1 or the speaker. In some examples, as described below, a disentangling step may be performed on the internal representations with the probe model trained on the disentangled representations. In other examples, the main model may be configured to promote disengaging of representations, for example by appropriately configuring the loss function during training of the main model.
[0070] A separate probe model may be trained for each clinically relevant feature to which the internal representations are mapped. As described in more detail in the specific example below, the clinically relevant features are properties of the speech or speaker which are impacted by a heath condition and these may be grouped into “perceptual domains” which define groups of measures associated with a particular characteristic of the speech or speaker. Examples of domains include prosody, syntactic complexity and episodic memory.
[0071] In some examples, such as that illustrated in
[0072] By training the probe model 30 to predict the measures of the syntactic complexity from the internal representation R.sub.1, the probe model learns which elements of the internal representation R.sub.1 are the best predictors of the syntactic complexity vector. For example, the weights and activations that the probe P1 learns to apply to the representation R.sub.1 can be used to determine which elements 31 of the representation R.sub.1 are given most weight by the probe in determining a prediction of the syntactic complexity measures. It is these representation elements 31 that encode the most relevant information for predicting that particular clinically relevant feature.
[0073] The elements 31 identified by the probe P1 can be used to form a vector which encodes syntactic complexity information of the input speech. The main model may not rely on this syntactic complexity information to make its prediction but this information can now be fed into the main model to improve performance in making the health condition prediction.
[0074] This vector provides a new data representation for making a health condition prediction. For example it can be used to encode input speech data to provide an Alzheimer's diagnosis solely on the basis of (in this illustrative example) syntactic complexity. By forming these new data representations for a number of important clinical domains, the method allows a clinician to understand the influence of the different clinical domains on the health condition prediction to better diagnose a patient. In particular the method provides a more complete diagnosis since it provides a measure of the contribution to the overall Alzheimer's diagnosis by different domains. This provides more granular information on how a patient is affected by a particular health condition and so can be used to better diagnose patients, as well as builder better predictive models and to devise better treatment plans to focus on the particular domains most affected, as will be described.
Main Model Structure and Training
[0075] The main model may be any neural network trained to map an input representation encoding speech data to an output representation for use in a health condition prediction. The speech data may include text and/or audio data of speech but preferably includes both the linguistic and acoustic content of a passage of speech. The input representation encodes linguistic, i.e. language features and/or acoustic speech information. Again, preferably the input representation encodes both linguistic and acoustic information to benefit from the full range of information available within the speech data.
[0076] In some examples the input representation may comprise selected features, extracted from the input speech. For example, features with known clinical rationale may be extracted from the input speech so as to impart additional clinical knowledge to the model. For example, given the noun rate is known to be an indicator for early Alzheimer's, the noun rate could be selected as an input feature within the input representation such that the main model does not have to learn this association during training.
[0077] In other preferable examples the main model may be a representation learning model, where features are not extracted manually but learnt in the process of training the model. An input representation, preferably comprising text and audio representations, is used to encode the raw speech data into a suitable format for processing and the model is trained to transform the input representation into an output representation which can be used by a prediction layer to provide a health condition prediction. By training the model end to end the model learns to transform the input representation into an appropriate output representation for providing the prediction of the health condition prediction.
[0078] Both feature based and representation learning models can be trained for use as the main model. Particularly advantageous model structures and training methods are described in the applicant's earlier European Patent Application number 20185364.2.
[0079] As described in the above mentioned patent application, the training may preferably take place in two stages. The first stage may comprise “pre-training” the model on large unlabelled data sets using unsupervised (or more specifically self-supervised) training in which one or parts of the input representation are masked or corrupted and the model is trained to predict the masked or corrupted representations, thereby learning internal representations which encode associations between the text and audio data usable to predict the masked audio or text representations. Given pre-training uses more widely available unlabelled speech data sets, it can be used to initialise the representations into a form which encodes general use information from the speech data which is usable in a subsequent health condition prediction.
[0080] The second stage may comprise task-specific fine tuning which the pre-trained model is fine-tuned using a smaller labelled data set for a particular health prediction task. Fine tuning involves encoding the labelled speech data into the input representation, adding a prediction layer 15 and training the model to map the input representation to the target health condition prediction such that the representations learnt by the model are further optimised for the particular heath prediction task.
[0081] After training of the main model, the model, and its representations, are frozen and no further changes to the model take place. The probe models are then trained using the fixed internal representations of the main model.
[0082] When the two stage training method of the main model is used, the probes may be trained on the pre-trained or fine-tuned model, although the methods of the present invention are preferably applied to the fine-tuned model to gain further information on the internal structure of the model relevant to the health condition prediction task of the fine-tuning step.
[0083] This two-stage training strategy is advantageous because it utilises more widely available non-labelled data sets to train the model and learn representations which encode information on the context of linguistic and acoustic features of language. The representations formed during this process therefore enclose a lot of general information on speech and language which can be utilised when fine tuning on the smaller clinical labelled data sets. However the fact that labelled clinical data sets are limited means that there is likely to be a large amount of useful information in the pre-trained representations which is not utilised by the main model when learning to make a health condition prediction during fine-tuning. The method of the present invention can be utilised to find associations within the data representations which are not being utilised by the main model to further improve its performance.
Probe Model Structure and Training
[0084] The probe model may comprise any type of machine learning model which can be trained to predict a measure of a clinically relevant feature of the input speech or speaker based on a speech data representation. The probe preferably comprises a linear model, multi-layer perceptron, an attention-based model or a Bayesian neural network and is preferably simple such that it does not internally learn to do the task in a sophisticated way.
[0085] The probe is trained to predict the clinically relevant feature of the input speech using an internal representation encoding the input speech within an internal layer of the main model, thereby learning associations within the internal representations which might not be learnt by the main model.
[0086] The probe model can be used to identify elements 31 of an internal representation R.sub.input which can be used to provide a prediction of the clinically relevant feature in a number of different ways. The elements can be selected based on those which provide the most accurate prediction or the elements can be selected based on those which require the simplest probe model structure or minimum amount of training data to provide a prediction of a given accuracy.
Specific Example of a Method According to the Present Invention
[0087] A specific example of the method according to the present invention is illustrated in
[0088] Step 1: Train the predictive model on the primary classification/regression task using a neural architecture and freeze the layers.
[0089] As described above, a main model, comprising a neural network, is trained on a primary health condition prediction task. Again, for the purpose of this illustrative example, the task is an Alzheimer's diagnosis classification task, although it could be any predictive task for monitoring or diagnosis of a health condition which potentially causes detectable changes in the speech of a patient.
[0090] The raw speech data is encoded into the input representation R.sub.input for processing. Preferably the input representation comprises audio representations encoding acoustic information of the raw speech data and linguistic representations encoding linguistic information of the input speech data. In certain preferable embodiments the input representations are combined audio-linguistic representation encoding the interrelation between the linguistic and acoustic information within the patient speech data. A method for forming such a combined audio-linguistic representation is described in European Patent Application number 20185364.2. In other examples the input representation might include solely audio, solely text or non-combined audio and text representations.
[0091] In this example, the model is trained on labelled speech data to predict the Alzheimer's diagnosis. Each subsequent layer learns a further transformed version of the input representation, with the final representation R.sub.output of the output layer usable by the classification layer 15 to provide the diagnosis.
[0092] After training the layers 10 of the model are fixed and no further changes take place in the further steps of the method.
[0093] Step 2: Define a set of feature domains associated with the health condition.
[0094] These “perceptual domains” are characteristics of the speech or speaker which are related to the health condition. They should be as clinically meaningful, separable and comprehensive as possible.
[0095] Each domain relates to a characteristic of the speech of speaker, which is influenced by the health condition and can be measured or estimated in one or more ways. For example, for Alzheimer's disease the perceptual domains might include phonation, articulation, prosody, affect, memory and syntactic complexity. Each of these characteristics of the speech or speaker change in a patient with Alzheimer's disease and the associated information may or may not be learnt in the process of training the main model.
[0096] Step 3: For each perceptual domain, define one or more constituent features of the speech, within that domain, that can be measured or estimated.
[0097] The features may be objective measures of the input speech or they may be human-rated, possibly more subjective features. The objective measures, for example the noun rate, may be derived automatically from the input speech using automated speech recognition methods. Other features, such as the human-rated scores may need to be assessed independently so that the training data set includes these measures of the speaker or speech.
[0098] For example in the case of the syntactic complexity domain, the objective automated measures of the speech may include the noun rate, the ratio of dependent clauses to T-units, mean length of clauses, number of verb phrases per T-unit etc., all of which may be derived automatically from the input audio and/or text data. The human-rated measures of syntactic complexity may include a human rating of syntactic complexity of the input speech, which would need to be assessed independently.
[0099] In contrast, for the episodic memory domain, the measures are generally carried out by way of neuropsychological test on the speaker, for example to provide a score for verbal episodic memory and a score for visual episodic memory.
[0100] Step 4: Apply one probe model to every layer in the trained main model for every feature in every perceptual domain and train all probes independently.
[0101] The probe model comprises a machine learning model but may take a number of different forms. Preferably it is a simple linear classifier or regression model, or an attention-based model. The probe models are simple models such that the probe cannot learn to do the task in a sophisticated way but instead simply learns the elements of the internal representation that can be used to predict the clinically relevant feature.
[0102] The probe models may be trained on the same speech data set used to train the primary prediction task of the main model or on a separate speech data set. The model training data is fed into the main model to get the internal representations of the training data and each probe is trained to predict the corresponding measure of the clinically relevant feature of the training data from the internal representations of the network layer to which it is applied.
[0103] The illustrative example of
[0104] Step 5: For each perceptual domain, find the layer at which its features overall can be predicted the best and in the most disentangled way.
[0105] For each domain, a separate probe is trained to predict the constituent features of that domain based on the internal representations of each network layer The one or more internal network layers may be selected based on one or more of (1) the accuracy of the prediction of the clinical feature provided by the internal representation of the layer; (2) the disentanglement of the elements of the layer; (3) the “amount of effort” required by the probe to achieve the prediction, in particular (i) the size of a probing model, and/or (ii) the amount of data needed to achieve a required prediction accuracy.
[0106] The representations are disentangled where certain elements of the representation decouple from the remaining elements and contribute much more strongly to the prediction of the clinically relevant feature. In this situation the clinical information usable to provide the prediction is encoded in a selection of well-defined sub-elements of the representations.
[0107] Step 6: Use the internal parameters learnt by each probe model to identify elements of the probed representation that are being used by the probe to predict the domain features.
[0108] During training of the probe model, the probe model adjusts various internal parameters in order to learn how to map the internal representation to the feature value. The internal parameters of the probe may include neuron weights, biases and activations. For example, the probe may be a simple neural network which learns the magnitude of the weight to apply to each element of the representation in order to provide the best prediction of the corresponding feature of the input speech. The learnt weights therefore indicate the elements of the representation which encode the most relevant information usable by the probe in making the prediction.
[0109] The probe weights can be thresholded to define the significance level at which the representation elements should be identified as being linked to the corresponding clinical domain probed by the probe model.
[0110] As illustrated in
[0111] The probe weights (and/or activations) are therefore thresholded to select only those representation elements which provide the most significant contribution to the prediction, as determined by the selected threshold. As shown in
[0112] Therefore after training the probes, a set of representation elements or “features” of the input speech is identified for each domain. The representation elements for each domain may come from a single layer or may be selected from multiple layers, where individual elements across layers are found to best predict the domain features. In some examples, certain vector elements may be shared between domains.
[0113] Each set of representation elements corresponding to a particular domain may be extracted from the network and combined into a domain vector. As shown in
[0114] These domain vectors 32, 42, 52 output from the method may be used in a number of ways. Importantly they provide information on the impact of that domain in the main model reaching the health condition prediction, in this case Alzheimer's, but they also provide data representations which can be used to encode input speech data for use in a new model, imparting greater clinical understanding into a predictive model and reducing the learning that the model must do, allowing for improved predictive performance with smaller data sets, as explained further below.
[0115] The following optional steps illustrate how the domain vectors can be used to perform a number of additional tasks and provide additional outputs.
[0116] Step 7: Perform a prediction on the main task using the perceptual domain vector.
[0117] As shown in
[0118] By inputting each of the domain vectors into a corresponding classifier it is possible to determine a component of the diagnosis corresponding to each domain. This set of scores ‘explains’ the overall Alzheimer's diagnosis, providing information on which aspects of the input speech are most indicative of an Alzheimer's diagnosis. This information is of significant value in both better understanding a particular health condition, the symptoms and how it effects speech. This output can also be used to help inform the building of better, more accurate predictive models.
[0119] As explained above, each domain vector also forms a newly identified data representation 32, 42, 52 for input speech that can be used as additional input to a diagnostic model.
[0120] Step 8: Form lower-dimensional representations of the domain vectors.
[0121] Alternatively or additionally, dimensionality reduction 35, 45, 55 may be performed on the domain vectors 32, 42, 52 to provide a reduced dimension domain vector 36, 46, 56. These can be used as the input to a classification or regression model, reducing the computational requirement in order to provide a diagnosis. This can also preserve more general information and help create a disentangled, potentially de-identified representations of the input speech.
[0122] The output products of the method after performing the additional steps 7 and 8 are shown in
[0123] These products provide a diagnostic kit that can be utilised by a clinician to provide a more accurate and complete Alzheimer's diagnosis. In particular the contribution of the different domains to the overall diagnosis could inform the clinician as to how advanced the Alzheimer's disease is, and indicate the severity of different symptoms to understand the ways in which it is affecting the patient. This understanding, and the more complete picture of the effects of the disease on a particular patient, can inform the treatment and care plan.
[0124] The output products shown in
[0125] If enough perceptual domains are used, which together provide sufficient granularity, the vectors formed from the identified speech data representations can replace the general speech data representations used as the input representation R.sub.input. That is, patient speech to be tested can be encoded directly into the “combined domain vector” (formed from the representation elements identified by each probe) and this can be used as the input into a predictive model to provide a health condition prediction.
[0126] Using a vector formed by the domain probes in this way has a number of advantages. Firstly, it can provide a reduced dimension representation compared to a general speech representation, reducing the computation requirement for training. The vectors can thus provide more efficient data representations which encode just the relevant clinical domain data for making a particular diagnosis. This shares the advantages of feature based methods in which features are extracted and placed into a vector but provides additional advantages in that it utilises the work of the main model in pre-forming more compact data representations during training of the main task, and also allows for a wider range of clinically relevant features to be used, including features of the speaker and human-based ratings such as neuropsychological tests.
[0127] Importantly, the vectors prepared from the representation elements identified by the domain probes can be prepared such that they are de-identified from the original speaker. In particular, the vectors prepared using probes for predicting clinical features in this way can select representations which are invariant to nuisance variables such as speaker gender, age or identity. Therefore the method can provide speech data representations which are de-identified from the original speaker. De-identified representations are particularly desirable as they mean patient data can be anonymised prior to testing to meet patient data privacy regulations.
[0128] By encoding patient speech data in the de-identified vectors formed by the domain probes, patient data can be stored for analysis in anonymised form, unlike general speech data representations from which the speaker identity can be determined.
Additional Disentanglement Steps
[0129] In certain preferable examples of the method according to the present invention, additional steps may be provided as part of the probing process to improve disentanglement of the internal representations of the main model. In particular, ideally the probe will identify a small number of representation elements which decouple from the remaining elements to encode the majority of the information relevant to a particular domain. In this way, a compact domain vector may be formed of relatively few elements which encode the vast majority of the relevant information to predict the features of that domain. To further promote the learning of disentangled representations, one or more additional steps may be taken.
[0130] A first option is to improve disentanglement of the representations learned by the main model by adapting the model structure and/or training strategy. In particular the loss function used in training the main model may be adapted to promote the learning of disentangled representations. For example the model may be a beta-VAE model as described in Higgins, I. et al. “beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework.” ICLR (2017). In this way, the trained main model will have sufficiently disentangled internal representations.
[0131] A second option is to carry out an additional intermediate step to perform disentanglement on the internal representations prior to application of the probe models, as illustrated in
[0132] After training of the main model the layers are their constituent representations are fixed. Input speech data is input into the fixed main model to encode the speech data into the internal representations, R.sub.1 and R.sub.2. A principal component analysis (PCA) is then performed on the elements of each internal representation to form a corresponding disentangled representation R.sub.1*, R.sub.2*, formed of a smaller number of elements. This method helps enhance disentanglement of the representations such that the information for a particular domain is encoded predominantly in a smaller number of representation elements in a transformed, reduced dimension vector space. Performing PCA on the representation elements of the internal representations therefore promotes the formation of a reduced number of disentangled vector elements, which form the disentangled representations R.sub.1*, R.sub.2*.
[0133] The method then continues as above from Step 4, with each probe model applied to predict the clinically relevant feature from the disentangled representations R.sub.1*, R.sub.2*. As before the probe model learns which of the elements 31, 41, 51 of the disentangled representations R.sub.1*, R.sub.2* that provide the most accurate prediction of the feature and these are selected to form the domain vectors.
[0134] In examples incorporating a disentanglement step, such as PCA, the disentangling step is considered part of the probe. In other words the step “training a probe comprising a machine learning model to map an internal representation . . . ” comprises (1) performing disentanglement on the internal representation to provide a disentangled representation and (2) training the machine learning model of the probe to the disentangled representation to map the disentangled representation to the independently determined measure of a clinically relevant feature associated with a particular domain.
[0135] As before the elements 31, 41, 51 of the disentangled representations R.sub.1*, R.sub.2* may be selected by the probe based on the weights and/or activation learnt by the probe model. As above, the internal network layers may be selected based on one or more of (1) the accuracy of the prediction of the clinical feature provided by the internal representation of the layer; (2) the disentanglement of the elements of the layer; (3) the “amount of effort” required by the probe to achieve the prediction, in particular (i) the size of a probing model, and/or (ii) the amount of data needed to achieve the high quality.
Quantifying the Relevant Information Encoded within the Representations
[0136] Furthermore, and importantly, probing can provide a quantifiable measure of the success of predicting a particular clinically relevant feature. Therefore when the method is applied in a providing a health condition application, this quantifiable probing technique can provide a quantified measure of the internal representations' success in encoding the relevant speech or speaker property, which can be provided as an output to a user.
[0137] To quantify how well the trained representations encode the clinically relevant speech signals, the method may use the accuracy of the probe or more preferably it may employ information-theoretic probing with minimum description length (as described in “Information-Theoretic Probing with Minimum Description Length”, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 183-196, Nov. 16-20, 2020). This technique provides an objective measure of how well information is encoded in the representations for each of the clinically relevant features predictive. In particular, it gives a measure of either (i) the size of a probing model or (ii) the amount of data needed to achieve a particular prediction accuracy.
Specific Application for Audio Representations
[0138] One important application of the present invention is the application of the method to probe audio representations, in particular prosody representations, which are particularly strong representations for use in health condition prediction tasks.
[0139] Prosody refers to the non-linguistic content of speech. Prosody is often defined substractively, for example as “the variation in speech signals that remains after accounting for variation due to phonetics, speaker identity, and channel effects (i.e. the recording environment)”. It can also be defined as the combination of timbre of speech (the spectral information which characterises a particular voice), the rhythm, pitch and tempo. Tempo relates to the speed and duration of voiced segments, while rhythm relates to the stress and intonation.
[0140] There are a large range of diseases that impinge upon the correct functioning of these physiological systems resulting in changes to both choice of language but also non-linguistic components, for example the hesitations, pitch, tempo and rhythm. For example cognitive disorders such as Alzheimer's affect the brain and therefore impact on speech through both the higher-level speech systems such as memory but also the lower-level physiology in terms of the brain's ability to control the vocal cord and articulatory system. Therefore there is a particular need for obtaining strong prosodic representations for use in speech analysis for health condition predictions. One significant issue is the difficulty in extracting prosodic representations which retain expressivity and encode all of the important non-linguistic information necessary for downstream speech analysis tasks, while being sufficiently de-identified from the speaker to protect user privacy and meet GDPR/HIPAA requirements. Much of the non-linguistic components of speech overlap with signals in the speech which are characteristic of the speaker.
[0141] Therefore the probing methods of the present invention can be applied to prosody representations to determine the extent to which the identifying (timbral) information has been removed and just the required non-timbral prosody components remain—which are those required for making strong health condition predictions. In particular the main model may be a model for encoding speech in prosody representations and the method of the invention may be applied by training a probe comprising a machine learning model, independently to the training of the main model, to map a prosodic representation of the input speech data to an independently determined measure of a clinically relevant feature of the input speech data or the speaker.
[0142]
Overview of Example Encoder Model Architecture
[0143] The prosody encoder model may be any model suitable for encoding the pre-processed sections of audio data into quantised audio representations. The prosody encoder preferably includes a machine learning model, trained to map sections of processed audio data to corresponding quantised audio representations of the sections of audio data.
[0144]
[0145] The input 810 to the model is sections of the pre-processed audio data. Preferably this comprises variable length, word-aligned audio, i.e. sections of the processed audio data which each include one spoken word. These sections of processed data are referred to as “audio words”.
[0146] The first stage of the model is the prosody encoder 820. This is a model, or series of models, configured to take one audio word as input and encode this single word as a corresponding quantised audio representation encoding the prosodic information of the audio word. Prosodic information is effectively encoded due to the pre-processing to remove speaker-identifying information from the raw audio input, in particular timbre, and due to various features of the model, described in more detail below.
[0147] The output of the prosody encoder stage 820 is therefore a sequence of quantised prosody representations 830, each encoding the prosodic information of one spoken word within the input speech and therefore together in sequence encoding the prosodic information of a length of audio data.
[0148] The prosody encoder 820 may have several possible different structures. As described below, in one example the prosody encoder comprises a first stage configured to encode each input audio word as a non-quantised audio representation and a second stage configured to quantise each non-quantised audio representations into one of a fixed number of quantised prosodic states (quantised prosody representations or prosody tokens). Further possible implementation details of the prosody encoder are set out below.
[0149] The sequence of prosody tokens 830 is then fed into a contexualiser model 840 to encode the quantised prosody representations into contexualised prosody representations. The contextualisation model 840 is preferably a sequence-to-sequence machine learning model configured to encode contextual information of a particular prosody token 831 into a new representation. The model is configured to encode information about the relationships between a quantised prosody representation 831 and the surrounding quantised representations within the sequence 830—commonly referred to as “context”. The contextualisation model 840 is preferably an attention based model, in particular a transformer encoder.
[0150] The output of the contextualisation model 840 is a sequence of contextualised prosody representations 850, each encoding the prosodic information of a particular audio word in the sequence and its relationship to the surrounding prosodic information in the sequence.
[0151] Both the tokenized prosody representations 830 or the contextualized prosody representations 850 can be used for downstream tasks, like expressive text-to-speech systems, spoken language understanding and speech analysis for the monitoring and diagnosis of a health condition. Both sets of representations encode just the prosodic information of the speech and are substantially de-identified so may be used where anonymising of user data is required.
Overview of Model Training
[0152]
[0153] Firstly the pre-processing is carried out on a training data set comprising raw audio speech data. The pre-processed raw audio 810 is fed into the prosody encoder 820, which produces one set of prosody tokens (P_i) 830 for each audio-word 810. In the illustrated example there are 3 tokens for each audio-word 810 but it can be 1 or more. At this stage, the model is completely non-contextual—each representation has only ever seen the audio for its own audio-word and not any information from the surrounding parts of the audio data. As described above the mode then comprises a contexulisation encoder 840, preferably a transformer, configured to encode the prosody tokens into contextualised representations 850.
[0154] The training process used is a form of self-supervised learning in which the model is trained to predict masked tokens from the surrounding context. This is a similar approach to that used in masked language models (see for example “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, Devlin et al. arXiv:1810.04805) but in this case the model uses solely audio, prosodic information and instead of training the model to predict the masked token a contrastive training approach is used in which the model is trained to predict the correct token from a number of different tokens.
[0155] In more detail, one or more tokens 830 output by the prosody encoder 820 are randomly masked 832, the model is given a number, for example 10, possible tokens and the model is then trained to predict the correct one from the group of possible tokens (i.e. which token corresponds to the token that has been masked). The other 9 tokens are masked states from other masked audio-words. One preferable feature of the training process is that the other tokens (the negatives) are selected from the same speaker. In this way the model is not encouraged to encode information that helps separates speakers and therefore further aids de-identification of the representations.
[0156] The network 800 is trained end to end so the prosody encoder 820 is trained together with the transformer encoder 840.
[0157] Preferably the model is configured to learn to always represent prosody as the same token at every timestep—so that the contextual prediction can be done with 100% accuracy. Once trained, input speech data can be fed into the model and either or both of the contextual representations (post-Transformer) or the pre-Transformer non-contextualized representations (or from any layer inside the Transformer) can be used for downstream speech processing tasks.
Application of the Probe Model
[0158] A probe model may then be applied as described above, with the probe trained, independently to the training of the encoder, to map a prosody representation to an independently determined measure of a clinically relevant feature of the input speech data or the speaker, where a clinically relevant feature is a property of the input speech or speaker that is impacted by a health condition.
[0159] By examining the success of the model in predicting a component of prosody it can be determined to what extent the prosodic representations encode information in speech related to that component. Furthermore, and importantly, probing can provide a quantifiable measure of the success of predicting a particular measure of prosody. Therefore when the method is applied in a technical application, this quantifiable probing technique, can provide a quantified measure of the prosodic representations' success in encoding the relevant prosodic property, which can be provided as an output to a user.
[0160] Of particular relevance is confirming that the prosodic representations encode each of the required components of prosody, other than the speaker identifying characteristics—timbre in particular. Therefore the method may further comprise training a probe model to predict audio features representative of the subcomponents of prosody: pitch, rhythm, tempo and timbre.
[0161] For pitch a probe model may be trained to predict the median pitch. For rhythm probe models may be trained to predict median word intensity and number of syllables. For tempo, probe models may be trained to predict articulation rate (syllables per second), speech rate, average syllable duration, and word duration (including pre-silence). For timbre, probe models may be trained to predict the median formants F1, F2, F3 (shifted).
[0162] To quantify how well the trained representations encode the prosodic information, the method may use the accuracy of the probe or more preferably it may employ information-theoretic probing with minimum description length (as described in “Information-Theoretic Probing with Minimum Description Length”, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 183-196, Nov. 16-20, 2020). This technique provides an objective measure of how well information is encoded in the quantised audio representations for each of the audio features representative of each subcomponent of prosody.
[0163] The probe models may be applied to both the quantised prosodic representations output from the product quantiser and the contextualised prosodic representations output from the contextualisation model, to provide an output to a user to inform on the information that is being encoded. The probe models may also be applied to the components of to product quantizer, where the product quantizer forms part of the prosody encoder and is configured to quantise the non-quantised representations provided by an initial encoding layer into a number of prosody components, preferably three. The application of the latter has shown that a product quantizer has the ability to naturally disentangle the information into the three non-timbral components of prosody.
[0164] The probe models comprise a machine learning model, preferably a simple classifier or regression model, trained separately to the encoder models to map one or more audio representations provided by the model to a measure of prosody. The probe preferably comprises a linear model, multi-layer perceptron, an attention-based model or a Bayesian neural network and is preferably simple such that it does not internally learn to do the task in a sophisticated way.