AUTOMATED ASSESSMENT OF COGNITIVE AND SPEECH MOTOR IMPAIRMENT

20230172526 · 2023-06-08

    Inventors

    Cpc classification

    International classification

    Abstract

    The application relates to devices and methods for assessing cognitive impairment and/or speech motor impairment in a subject. The method comprises analysing a voice recording from a word- reading test obtained from the subject by identifying a plurality of segments of the voice recording that correspond to single words or syllables and determining the number of correctly read words in the voice recording and/or the speech rate associated with the recording. Determining the correct number of words in the recording may comprise computing one or more Mel-frequency cepstral coefficients (MFCCs) for the segments, clustering the resulting vectors of values into n clusters, wherein each cluster has n possible labels, predicting a sequence of words in the voice recording using the labels associated with the clustered vectors of values, performing a sequence alignment between the predicted sequence of words and the sequence of words used in the word reading test, selecting the labels that result in the best alignment and counting the number of matches in the alignment. The devices and methods find use in the diagnosis and monitoring of diseases or disorders such as neurological disorders.

    Claims

    1. A method of assessing cognitive impairment or speech motor impairment in a subject, the method comprising: obtaining a voice recording from a word-reading test from the subject; and analysing the voice recording, or a portion thereof, by: identifying a plurality of segments of the voice recording that correspond to single words or syllables; and (a) determining the number of correctly read words in the voice recording, wherein the voice recording is from a word-reading test comprising reading a sequence of words drawn from a set of n words, and wherein the method comprises: computing one or more Mel-frequency cepstral coefficients (MFCCs) for the segments to obtain a plurality of vectors of values, each vector being associated with a segment, clustering the plurality of vector of values into n clusters, wherein each cluster has n possible labels corresponding to each of the n words; for each of the n! permutations of labels, predicting a sequence of words in the voice recording using the labels associated with the clustered vectors of values, and performing a sequence alignment between the predicted sequence of words and the sequence of words used in the word reading test; selecting the labels that result in the best alignment and counting the number of matches in the alignment, wherein the number of matches corresponds to the number of correctly read words in the voice recording; or (b) determining the speech rate associated with the voice recording by counting the number of segments identified in the voice recording; wherein identifying segments of the voice recording that correspond to single words or syllables comprises: obtaining a power Mel-spectrogram of the voice recording; computing the maximum intensity projection of the Mel spectrogram along the frequency axis; and defining a segment boundary as the time point where the maximum intensity projection of the Mel spectrogram along the frequency axis crosses a threshold.

    2. The method of claim 1, wherein wherein the voice recording is from a word-reading test comprising reading a sequence of words drawn from a set of n words, and the method comprises: identifying a plurality of segments of the voice recording that correspond to single words or syllables by: obtaining a power Mel-spectrogram of the voice recording; computing the maximum intensity projection of the Mel spectrogram along the frequency axis; and defining a segment boundary as the time point where the maximum intensity projection of the Mel spectrogram along the frequency axis crosses a threshold; and determining the number of correctly read words in the voice recording, by: computing one or more Met-frequency cepstral coefficients (MFCCs) for the segments to obtain a plurality of vectors of values, each vector being associated with a segment, clustering the plurality of vector of values into n clusters, wherein each cluster has n possible labels corresponding to each of the n words; for each of the n! permutations of labels, predicting a sequence of words in the voice recording using the labels associated with the clustered vectors of values and performing a sequence alignment between the predicted sequence of words and the sequence of words used in the word reading test; selecting the labels that result in the best alignment and counting the number of matches in the alignment, wherein the number of matches corresponds to the number of correctly read words in the voice recording; wherein the method further comprises determining the speech rate associated with the voice recording by counting the number of segments identified in the voice recording.

    3. The method of claim 1, wherein identifying segments of the voice recording that correspond to single words or syllables further comprises normalising the power Mel-spectrogram of the voice recording, preferably against the frame that has the highest energy in the recording.

    4. The method of claim 1, wherein determining the speech rate associated with the voice recording comprises computing a cumulative sum of the number of identified segments in the voice recording over time and computing the slope of a linear regression model fitted to the cumulative sum data.

    5. The method of claim 1, wherein identifying segments of the voice recording that correspond to single words or syllables further comprises: performing onset detection for at least one of the segments by computing a spectral flux function over the Mel-spectrogram of the segment and defining a further boundary whenever an onset is detected within a segment, thereby forming two new segments.

    6. The method of claim 1, wherein identifying segments of the voice recording that correspond to single words or syllables further comprises excluding segments that represent erroneous detections by computing one or more Mel-frequency cepstral coefficients (MFCCs) for the segments to obtain a plurality of vectors of values, each vector being associated with a segment, and applying an outlier detection method to the plurality of vectors of values.

    7. The method of claim 1, wherein the words are colour words, wherein the words are displayed in a single colour in the word reading test.

    8. The method of claim 1, wherein computing one or more MFCCs to obtain a vector of values for a segment comprises: computing a set of i MFCCs for each frame of the segment for each i and obtaining a set of j values for the segment by interpolation, preferably linear interpolation, to obtain a vector of ixj values for the segment.

    9. The method of claim 1, wherein clustering the plurality of vector of values into n clusters is performed using k-means.

    10. The method of claim 1, wherein the sequence alignment step is performed using a local sequence alignment algorithm, preferably the Smith-Waterman algorithm, or wherein performing a sequence alignment comprises obtaining an alignment score and the best alignment is one that satisfies at least a predetermined criteria applying to the alignment score, preferably wherein the best alignment is the alignment with the highest alignment score.

    11. A method of assessing the severity of a disease, disorder or condition in a subject, the method comprising analysing a voice recording from a word-reading test from the subject, or a portion thereof, as described in claim 1, wherein the disease, disorder or condition is one that affects speech motor or cognitive abilities, wherein the method further comprises obtaining a voice recording from a word-reading test from the subject.

    12. The method of claim 11, wherein obtaining a voice recording comprises receiving a word recording from a computing device associated with the subject, wherein obtaining a voice recording further comprises causing a computing device associated with the subject to display a set of words and to record a voice recording.

    13. The method of claim 11, wherein assessing speech motor impairment or assessing the severity of a disease, disorder or condition in a subject in a subject comprises predicting a UHDRS dysarthria score for the subject by: defining a plurality of UHDRS dysarthria score classes corresponding to non-overlapping ranges of the UHDRS dysarthria scale; determining the speech rate associated with the voice recording from the subject; and classifying the subject as belonging to one of the plurality of UHDRS dysarthria score classes based on the determined value of the speech rate.

    14. The method of claim 11, wherein assessing cognitive impairment or assessing the severity of a disease, disorder or condition in a subject comprises predicting a UHDRS Stroop word score for the subject by: determining the correct word count associated with the voice recording from the subject; and scaling the correct word count.

    15. A system for assessing the severity of a disease, condition or disorder in a subject, the system comprising: at least one processor; and at least one non-transitory computer readable medium containing instructions that, when executed by the at least one processor, cause the at least one processor to perform operations comprising the operations described in claim 1.

    Description

    BRIEF DESCRIPTION OF THE FIGURES

    [0159] FIG. 1 shows an exemplary computing system in which embodiments of the present invention may be used.

    [0160] FIG. 2 is a flow chart illustrating a method of assessing cognitive impairment in a subject.

    [0161] FIG. 3 is a flow chart illustrating a method of assessing speech motor impairment in a subject.

    [0162] FIG. 4 illustrates schematically a method of assessing the severity of a disease, disorder or condition in a subject.

    [0163] FIGS. 5A and 5B show a screenshot of a mobile device word-reading application (A) and a workflow for remote assessment (B) using the recordings from the application in (A). (A) A smartphone-based Stroop word-reading test was developed as a custom Android application. Color words were displayed in black on the screen according to a randomly generated sequence: 4 words per row and total 60 words were displayed. (B) A flow diagram describes the key steps for remote monitoring symptoms.

    [0164] FIGS. 6A and 6B illustrate a two-step approach for identifying word boundaries according to an exemplary embodiment. (A) Coarse word boundaries were identified on the relative energy measure. A Mel-frequency spectrogram of the input audio input was constructed and the maximum intensity projection of the Mel spectrogram along frequency axis gave rise to the relative energy (B) One coarsely segmented word (highlighted in grey) was divided into two putative words based on the onset strength.

    [0165] FIG. 7 illustrates an outlier removal approach according to an exemplary embodiment. All segmented words were parameterized using the first three MFCC (Mel-frequency cepstral coefficients), where inliers (putative words, n = 75) shown in grey and outliers (non-speech sounds, n = 3) in black were illustrated in a 3-D scatter plot.

    [0166] FIGS. 8A and 8B illustrates a clustering approach to identify words according to an exemplary embodiment. Putative words from one recording (where 3 different words were shown in the word reading test) were grouped into three different clusters by applying K-means clustering. Visual appearance of words in three distinctive clusters was shown in the upper graphs (one word per row) and the corresponding cluster centers were shown in the lower graphs. In particular, (A) represents 3 word-clusters from one test spoken in English (words = 75) and (B) represents 3 word-clusters from another test spoken in German (words = 64).

    [0167] FIG. 9 illustrates a word sequence alignment approach according to an exemplary embodiment. In particular, the application of the Smith-Waterman algorithm on a 10 words sequence is shown. Alignment of displayed sequence RRBGGRGBRR and predicted sequence BRBGBGBRRB found the partially overlapping sequence and resulted in 5 correct words: matches (|), gaps (-), and mismatches (:).

    [0168] FIGS. 10A and 10B show the classification accuracy of the model-free word recognition algorithm according to an exemplary embodiment. Classification accuracy of each word displayed as a normalized confusion matrix (row sum = 1). Rows represent true labels from manual annotations and columns represent predicted labels from the automated algorithm. The correct predictions are on the diagonal with a black background and the incorrect predictions are with a grey background. (A) English words: /r/ for /red/ (n = 582), /g/ for /green/ (n = 581), and /b/ for /blue/ (n = 553). (B) German words: /r/ for/rot/ (n = 460), /g/ for /grün/ (n = 459), and /b/ for /blau/ (n = 429).

    [0169] FIGS. 11A and 11B show the comparison between clinical UHDRS-Stroop word score and automated assessment measures according to an exemplary embodiment. Scatter plot of the clinical UHDRS-Stroop word score and automated assessment of the number of correct words (A) and speech rate (B) according to an exemplary embodiment are shown. A linear relationship between variables was determined through regression. The resulting regression line (black line) and a 95% confidence interval (grey shaded area) were plotted. Pearson’s correlation coefficient r and the significance level of p-value were shown on both graphs.

    [0170] FIGS. 12A and 12B show boxplots of speech rate and number of correct words obtained by automated assessment according to an exemplary embodiments, demonstrating that there measures were significantly reduced in subgroup of HD patients with dysarthria. Comparison between groups (normal speech n = 30, dysarthria speech n = 16): * p < 0.05; ** p < 0.01; *** p < 0.001. (A) Speech rate was significantly reduced in subgroup of HD patients with dysarthria (words/sec; 1.8 ± 0.3 vs 1.5 ± 0.3; p < 0.01; Cohens′ d = 1.086). (B) The number of correct words was significantly reduced in subgroup of HD patients with dysarthria (66.8 ± 15.9 vs 48.7 ± 16.1; p < 0.001; Cohens′ d = 1.110).

    [0171] FIGS. 13A and 13B show the distribution of number of correctly read words (A) and the number of single words/syllable segments (B) identified in sets of recordings in English. French, Italian and Spanish. The data shows that the number of correctly read words identified according to the method described herein is robust to variations in the length of the words (FIG. 13A), even though multiple syllables in single words are identified as separate entities (FIG. 13B).

    [0172] FIGS. 14A and 14B show the results of matched Stroop word reading (A, consistent condition) and Stroop colour words reading (B, interference condition) test from a healthy individual, analysed as described herein. Each subfigure shows the set of words displayed in each test (top panel), the normalised signal amplitude for the respective recording (middle panel), with overlaid segment identification and word prediction (illustrated as the colour of each segment), and the Mel spectrogram and accompanying scale (bottom panel) for the signal shown in the middle panel. The data shows that the segment identification and correct word counting processes performs equally well for both the consistent condition and the interference condition.

    [0173] Where the figures laid out herein illustrate embodiments of the present invention, these should not be construed as limiting to the scope of the invention. Where appropriate, like reference numerals will be used in different figures to relate to the same structural features of the illustrated embodiments.

    DETAILED DESCRIPTION

    [0174] Specific embodiments of the invention will be described below with reference to the Figures.

    [0175] FIG. 1 shows an exemplary computing system in which embodiments of the present invention may be used.

    [0176] A user (not shown) is provided with a first computing device - typically a mobile computing device such as a mobile phone 1 or tablet. Alternatively, the computing device 1 may be fixed, such as e.g. a PC. The computing device 1 has at least one processor 101 and at least one memory 102 together providing at least one execution environment. Typically, a mobile device has firmware and applications run in at least one regular execution environment (REE) with an operating system such as iOS, Android or Windows. The computing device 1 may also be equipped with means 103 to communicate with other elements of computing infrastructure, for example via the public internet 3. These may comprise a wireless telecommunications apparatus for communication with a wireless telecommunications network and local wireless communication apparatus to communicate with the public internet 3 using e.g. Wi-Fi technology.

    [0177] The computing device 1 comprises a user interface 104 which typically includes a display. The display 104 may be a touch screen. Other types of user interfaces may be provided, such as e.g. a speaker, keyboard, one or more buttons (not shown), etc. Further, the computing device 1 may be equipped with sound capture means, such as a microphone 105.

    [0178] A second computing device 2 is also shown in FIG. 1. The second computing device 2 may for example form part of an analysis provider computing system. The second computing device 2 typically comprises one or more processors 201 (e.g. servers), a plurality of switches (not shown), and one or more databases 202, and is not described further here as the details of the second computing device 2 used are not necessary for understanding how embodiments of the invention function and may be implemented. The first computing device 1 can be connected to the analysis provider computing device 2 by a network connection, such as via the public internet 3.

    [0179] FIG. 2 is a flow chart illustrating a method of assessing cognitive impairment in a subject. The method comprises obtaining 210 a voice recording from a word-reading test from the subject. The voice recording is from a word-reading test comprising reading a sequence of words drawn from a (closed) set of n words.

    [0180] At step 220, a plurality of segments of the voice recording that correspond to single words or syllables are identified. Step 220 may be performed as described below in relation to FIG. 3 (step 320).

    [0181] At steps 230-270, the number of correctly read words in the voice recording is determined. The number of correctly read words in the voice recording is indicative of the level of cognitive impairment of the subject.

    [0182] In particular, at step 230, one or more Mel-frequency cepstral coefficients (MFCCs) are computed for each of the segments identified at step 220. As a result, a plurality of vectors of values is obtained, each vector being associated with a segment. In the embodiment shown on FIG. 2, optional steps of normalising 232 the MFCCs across segments in the recording and compressing 234 each of the plurality of vectors to a common size are shown. In particular, a set of i MFCCs (e.g. 12 MFCCs: MFCCs 2 to 13) is computed for each frame of the segment and a set of j values (e.g. 12 values) is obtained for the segment by compressing the signal formed by each of the i MFCCs across the frames in the segment, to obtain a vector of ixj values (e.g. 144 values) for the segment.

    [0183] At step 240, the plurality of vectors of values are clustered into n clusters (e.g. using k-means), where n is the expected number of different words in the word-reading test. A particular label (i.e. word identity) is not associated with each cluster. Instead, it is assumed that segments that correspond to the same word (in case of monosyllabic words) or to the same syllable of the same word (in the case of disyllabic words) will be captured by MFCCs that cluster together. In the case of disyllabic words, one of the syllables in a word may be dominant in the clustering, and it is assumed that segments corresponding to the same dominant syllable will be captured by MFCCs that cluster together. Non-dominant syllables may effectively act as noise in the clustering. Following these assumptions, each cluster should primarily group values corresponding to segments that contain one of the n words, and one of the n! possible permutation of the n labels for these clusters corresponds to the (unknown) true labels.

    [0184] At step 250, a sequence of words in the voice recording is predicted for each of n! possible permutation of the n labels. For example, for a possible assignment of the n labels, a cluster is predicted for the identified segments and the corresponding label is predicted as the word that is captured in the identified segments. Some identified segments may not be associated with a cluster, for example because the MFCCs for the segment are not predicted to belong to a particular cluster with a high enough confidence. In such cases, no word may be predicted for this segment. This may be the case e.g. for segments that correspond to erroneous detections of syllables/words, or segments that correspond to a non-emphasized syllable of a multi-syllable word.

    [0185] At step 260, a sequence alignment is performed (e.g. using the Smith-Waterman algorithm) between each of the predicted sequences of words and the sequence of words used in the word reading test. The sequence of words used in the word reading test may be retrieved from memory, or may be received (for example, together with the voice recording) by the processor implementing the steps of the method.

    [0186] At step 270, the labels that result in the best alignment (for example, the labels that result in the highest alignment score) are selected and assumed to be the true labels for the cluster, and the number of matches in the alignment is assumed to correspond to the number of correctly read words in the voice recording .

    [0187] FIG. 3 is a flow chart illustrating a method of assessing speech motor impairment in a subject. The method comprises obtaining 310 a voice recording from a word-reading test from the subject. The voice recording may be from a word-reading test comprising reading a sequence of words drawn from a (closed) set of n words. Any other reading test may be used. In particular, there is no requirement for the words to have any particular meaning or logical connection.

    [0188] At step 320, a plurality of segments of the voice recording that correspond to single words or syllables are identified. It is particularly advantageous for the words used in the reading test to be monosyllabic as in such cases each segment may be assumed to correspond to a single word, and the timing of segments can therefore be directly related to speech rate. Where disyllabic words (or other multi-syllabic words) are used, it may be advantageous for all words to have the same number of syllables as this may simplify the calculation and/or interpretation of the speech rate.

    [0189] At step 330, the speech rate associated with the voice recording is determined at least in part by counting the number of segments identified in the voice recording. The speech rate in the voice recording is indicative of the level of speech motor impairment of the subject. Optionally, determining the speech rate at step 330 may comprise computing 331 a cumulative sum of the number of identified segments in the voice recording over time, and determining 332 the slope of a linear regression model fitted to the cumulative sum data.

    [0190] In particular, at step 322, a power Mel-spectrogram of the voice recording is obtained. This is typically achieved by defining frames along the voice recording (where a frame can correspond to the signal in a sliding window of fixed width applied along the time axis) and computing a power spectrum on a Mel scale for each frame (typically by obtaining a spectrogram for each frame then mapping the spectrogram to a Mel scale using overlapping triangular filters along a range of frequencies assumed to correspond to the human hearing range). This process results in a matrix of values of power per Mel unit per time bin (where a time bin corresponds to one of the positions of the sliding window). Optionally, the power Mel-spectrogram may be normalised 323, for example by dividing the values for each frame by the highest energy value observed in the recording. At step 324, the maximum intensity projection of the Mel spectrogram along the frequency axis is obtained. Segment boundaries are identified 326 as time points where the maximum intensity projection of the Mel spectrogram along the frequency axis crosses a threshold. In particular, a set of two consecutive boundaries that are such that the maximum intensity projection of the Mel spectrogram crosses the threshold from a lower to a higher value at the first boundary, and the maximum intensity projection of the Mel spectrogram crosses the threshold from a higher to a lower value a the second boundary may be considered to define a segment that corresponds to a single word or syllable. The threshold used at step 326 may optionally be dynamically determined at step 325 (where the word “dynamically determined” refers to the threshold being determined for a particular voice recording, depending on features of the particular voice recording, rather than being predetermined independently of the particular recording). For example, the threshold may be determined as a weighted average of the relative energy values assumed to correspond to signal and the relative energy values assumed to correspond to background noise.

    [0191] Optionally, the segments may be “refined” by analysing separate segments identified in step 326 and determining whether further (internal) boundaries can be found. This may be performed by performing 327 onset detection for at least one of the segments by computing a spectral flux function over the Mel-spectrogram for the segment and 328 defining a further (internal) boundary whenever an onset is detected within a segment, thereby forming two new segments. Performing 327 onset detection may comprise computing 327a a spectral flux function or onset strength function, normalising 327b the onset strength function for the segment to a value between 0 and 1, smoothing 327c the (normalised) onset strength function and applying 327d a threshold to the spectral flux function or a function derived therefrom, wherein an onset is detected where the function increases above the threshold.

    [0192] An optional erroneous detection removal step 329 is shown on FIG. 3. In the embodiment shown, this comprises computing 329a one or more Mel-frequency cepstral coefficients (MFCCs) for the segments (preferably the first 3 MFCCs as these are expected to capture features that distinguish between noise and true utterances) to obtain a plurality of vectors of values, each vector being associated with a segment, and excluding 329b all segments whose vector of values is above a predetermined distance from the remaining vectors of values. This approach assumes that the majority of segments are correct detections (i.e. correspond to true utterances), and that segments that do not contain true utterances with have different MFCC features from correct detections. Other outlier detection methods may be applied to exclude some of the plurality of vectors of values assumed to be associated with erroneous detections.

    [0193] The segments identified in step 320 may be used to determine the number of correctly read words in a word reading test as described in relation to FIG. 2 (steps 230-270).

    [0194] FIG. 4 illustrates schematically a method of monitoring a disease, disorder or condition in a subject. The disease, disorder or condition is one that affects speech motor and/or cognitive abilities.

    [0195] The method comprises obtaining 410 a voice recording from a word-reading test from the subject. In the illustrated embodiment, obtaining a voice recording comprises causing 310a a computing device associated with the subject (e.g. computing device 1) to display a set of words (e.g. on display 104) and causing 310b the computing device 1 to record a voice recording (e.g. through microphone 105). Optionally, obtaining a voice recording may further comprises causing 310c the computing device to emit a reference tone. Obtaining 310 a voice recording from a word-reading test from the subject may instead or in addition comprise receiving a voice recording from a computing device associated with the subject (e.g. computing device 1).

    [0196] The method further comprises identifying 420 a plurality of segments of the voice recording that correspond to single words or syllables. This is preferably performed as explained in relation to FIG. 3. The method optionally further comprises determining 430 the speech rate associated with the voice recording, at least in part by counting the number of segments identified in the voice recording. The method further comprises determining 470 the number of correctly read words in the voice recording, as explained in relation to FIG. 2 (steps 230-270). The number of correctly read words in the voice recording is indicative of the level of cognitive impairment of the subject, and the speech rate is indicative of the level of speech motor impairment of the subject. The method may further comprise comparing 480 the speech rate and correct number count obtained at steps 430 and 470 with previously obtained values for the same subject, or with one or more reference values. The comparison with previously obtained values for the same subject may be used to monitor a disease, disorder or condition in a subject who has been diagnosed as having the disease, disorder or condition. The comparison with one or more reference values may be used to diagnose the subject as having the disease, disorder or condition. For example, the reference values may correspond to a diseased population and/or a healthy population. The monitoring of a disease, disorder or condition in a subject may be used to automatically assess a course of treatment, for example as part of a clinical trial.

    [0197] Any of the steps of identifying 420 a plurality of segments of the voice recording that correspond to single words or syllables, determining 430 the speech rate associated with the voice recording, and determining 470 the number of correctly read words in the voice recording may be performed by the user computing device 1, or by the analysis provider computer 2.

    EXAMPLES

    Example 1: Development of an Automated Smartphone-based Stroop Word-reading Test for The remote monitoring of disease symptoms in Huntington’s disease

    [0198] In this example, the inventors developed an automated smartphone-based Stroop word-reading test (SWR) and tested the feasibility of remote monitoring of disease symptoms in HD. In the smartphone-based SWR test, colour words were displayed in black on the screen according to a randomly generated sequence (4 words per row and total 60 words are displayed). Speech data were recorded with built-in microphone and uploaded via WiFi to cloud. We then developed a novel, language-independent approach to segment and classify individual words from speech signal. Finally, by comparing the displayed-word sequence with the predicted-word sequence, we were able to reliably estimate the number of correct words using the Smith-Waterman algorithm, commonly used for genomic sequence alignment.

    Methods

    [0199] Subjects and relative clinical assessments: Forty-six patients were recruited from three sites, including Canada, Germany and the United Kingdom, as part of the HD OLE (open-label extension) study (NCT03342053). All patients underwent an extensive neurological and neuropsychological examination at the baseline visit. The Unified Huntington’s Disease Rating Scale (UHDRS) was used to quantify disease severity. In particular, Stroop word-reading test (SCWT1-Word Raw Score) is part of the UHDRS cognitive assessment and dysarthria (UHDRS-dysarthria score) is part of the UHDRS motor assessment. The language spoken locally at each site was used (i.e. English in Canada and the United Kingdom n = 27, German in Germany n = 19).

    [0200] Smartphone App and self-administrated speech recordings: A smartphone-based Stroop word-reading test was developed as a custom Android application (Galaxy S7; Samsung, Seoul, South Korea) shown in FIG. 5A and used as part of the smart-device test suite designed for remote monitoring disease symptoms in HD by Roche. At the baseline visit, patients received a smartphone and completed a test in a teaching session. The speech tests were then performed remotely at home weekly. Speech signals were acquired at 44.1 kHz with 16-bit resolution and down sampled to 16 kHz for analysis. Data was securely transferred via WiFi to Roche, where it was processed and analysed. Data presented here were the first self-administered home tests (n = 46) only. Total 60 colour words (4 words per row) were displayed in black according to a randomly generated sequence and stored as metadata explicitly (FIGS. 5 A). Patients read the words after a brief reference tone (1.1 kHz, 50 ms) for a given 45-second period and restarted from the beginning once the end was reached. All recordings analysed here were with a low ambient noise level (-56.7 ± 7.4 dB, n = 46) and good signal-to-noise ratio (44.5 ± 7.8 dB, n = 46).

    [0201] Language-independent approach for analysing the Stroop word-reading test: With consideration of potential usage in multi-language and various diseased population settings, an algorithm was designed without any pre-trained models. Words were segmented directly from the speech signal in the absence of any contextual cues. At the classification stage, word label was chosen such that it maximizes partial overlaps between displayed and predicted sequence. The fully-automated approach for the Stroop word-reading test can be divided into four parts, illustrated as a flow diagram in FIG. 5B. Briefly, the inventors first introduced a two-step approach to obtain a highly sensitive segmentation of individual words. The inventors then deployed an outlier removal step to filter out error detections mainly caused by imprecise articulation, respirations and non-speech sound. They then converted each putative word represented by 144 (12 × 12) Mel-frequency cepstral coefficient (MFCC) features and performed a three-class K-means clustering. Finally, the inventors adopted the Smith-Waterman algorithm, a local sequence alignment method, to estimate the number of correct words. Each of these steps will be explained in further detail below.

    [0202] Identifying word boundaries: In this particular example, each colour word used consisted of a single syllable, i.e. /red/, /green/, /blue/ in English and /rot/, /grün/, /blau/ in German. The word segmentation therefore becomes a general syllable detection problem. According to phonology, the nucleus of a syllable also called the peak, is the central part of a syllable (most commonly a vowel), whereas consonants form the boundaries in between [9]. A number of automatic syllable detection methods have been described for connected speech [10-12]. For example, syllabic nuclei were identified mainly based upon either the wide-band energy envelope [10] or the sub-band energy envelope [11]. However, for fast speech, the transition between different syllables is difficult to identify by energy envelope alone. When considering the fast tempo and syllable repetition in the word-reading task, there is still a need for more sensitive syllable nuclei identification.

    [0203] The newly developed two-step approach was motivated by how hand-label syllable boundaries were performed - visual inspection of intensity and spectral flux of a spectrogram. Briefly, a power Mel-spectrogram was first computed with a sliding window size of 15 ms and a step size of 10 ms, 138 triangular filters that span the range of 25.5 Hz to 8 kHz, and normalized against the strongest frame energy in a 45 s period. The maximal energy of a speech frame was then derived to represent intensity that is equivalent to a maximum intensity projection of the Mel-spectrogram along frequency axis. In this way, the loudest frame will have relative energy value of 0 dB and others will have values below it. For example, as shown in FIG. 6A, all syllabic nuclei have relative energy over -50 dB. Coarse word boundaries were identified by thresholding on the relative energy measure.

    [0204] Subsequently, the spectral flux of the Mel-spectrogram was calculated to identify the precise boundary of each word. This is equivalent to a vertical edge detection on a Mel-spectrogram. The onset strength was computed with the superflux method developed by Böck S and Widmer G [13] and normalized to a value between 0 and 1. If the onset strength is over a threshold i.e. 0.2, the segment is divided into sub-segments. One coarsely segmented word (highlighted in grey) was divided into two putative words based on the onset strength shown in FIG. 6B.

    [0205] All of the calculations were performed in Python, using the Librosa library (https://librosa.github.io/librosa/, McFee et al. [21]) or the python_speech_features library (https://github.com/jameslyons/python speech features, James Lyons et al. [22]). For the computation of the onset strength, the function librosa.onset.onset_strength was used with parameters lag = 2 (time lag for computing differences) and max_size = 3 (size of the local max filter). In the example shown on FIGS. 6A-B, 68 coarse segments were identified in the first step, and a further 10 were identified in the refinement step.

    [0206] In order to remove erroneous detections mainly caused by imprecise articulation, respirations and non-speech sound, an outlier removal step was implemented. Observations shorter than 100 ms and mean relative energy value less than -40 dB were firstly removed. Mel-frequency cepstral coefficients (MFCCs) are commonly used as features in speech recognition system [14, 15]. Here, we computed a matrix of 13 MFCCs with a sliding window size of 25 ms and a step size of 10 ms for each putative word. Audible noises are expected to differ from true words by the first three MFCCs [16]. We therefore parameterized the words using means of the first three MFCC and performed outlier detection based on the Mahalanobis distance. A cut-off value of 2 standard deviations was used to identify outliers. The inliers (putative words) shown in grey and outliers (non-speech sounds) in black were illustrated in a 3-D scatter plot in FIG. 7.

    [0207] K-means clustering: K-means is an unsupervised clustering algorithm which divides observations into k clusters [17]. The inventors assumed that words pronounced by a subject at a given recording will have a similar spectral representation within word-cluster, while a different pattern between word-clusters. In this way, one can divide words into 3 clusters that is equal to the number of unique colour words. However, the duration of words may vary from one to another (mean duration between 0.23 and 0.35 ms). The steps to generate an equal sized feature representation for each word are as follows: starting from a previously computed 13 MFCCs matrix, the first MFCC (related to power) was removed from the matrix. The remaining 12 MFCCs matrix with various frame number was treated as an image and resized to a fixed-size image (12 × 12 pixels, reduced to 40%-60% of its width) by a linear interpolation along the time axis. As a result, each word was transformed to total 144 MFCC values (12 × 12 = 144) regardless of its duration. By applying K-means clustering, putative words from one recording were classified into three different clusters. FIG. 8 illustrates the visual appearance of words in three distinctive clusters shown in upper graphs (one word per row) and the corresponding cluster centres shown in lower graphs, particularly FIG. 8A represents 3 word-clusters extracted from one test in English (words = 75) and FIG. 8B represents 3 word-clusters extract from one test in German (words = 64).

    [0208] Word sequence alignment: Speech recognition refers to understand the speech content. In principle, it is possible to use deep learning models (e.g. Mozilla’s free speech recognition project DeepSpeech) and hidden Markov models (e.g. Carnegie Mellon University’s Sphinx toolkit) to perform speech recognition. However, such pre-trained models are built on healthy population and are language dependent, and might not be very accurate when applied to patients with speech impairments. In this study, the inventors introduced a novel end-to-end solution to infer speech content. They converted such a word recognition task to a genomic sequence alignment problem. The closed-set of colour words are like the letters of the DNA code. Reading errors and system errors introduced during segmentation and clustering steps are like mutations, deletions, or insertions occurring in the DNA sequence of a gene. Instead of performing isolated word recognition, the objective was to maximize the overlapping sequence between the displayed and predicted sequence, so that the entire speech content is leveraged as a whole.

    [0209] The Smith-Waterman algorithm performs the local sequence alignment that is some characters may not be considered, thus it is appropriate for partially overlapping sequences [18]. The algorithm enables to compare segments of all possible lengths and optimizes the similarity measure based on a scoring metric, e.g. a gap cost =2 match score=3. In this study, the number of segmented words defines the search space in the displayed sequence. In a three-class scenario, there are 6 (3!=6) possible permutations of word labels. For each permutation, it is possible to generate a predicted sequence, align with the displayed sequence, and trace back the segment that has the highest similarity score. The inventors made the assumption that subjects read words as displayed most of the time. Therefore, the segment length becomes the measure to maximize in the problem. In other words, the optimal choice of a label for a given cluster is found in a way that maximizes the overlapping sequences. Consequently, each word can be classified according to respective cluster labels. Moreover, the number of exact matches found in the partially overlapping sequences provides a good estimation of the number of correct words. FIG. 9 takes the alignment of displayed sequence RRBGGRGBRRG and predicted sequence BRBGBGBRRB as an example and returns 5 correct words out of 10 read words.

    [0210] Manual level ground truth: Manual annotations of all segmented words (1938 words from 27 recordings in English, 1452 words from 19 recordings in German) were performed blindly via audio playback. Manual label was performed after the algorithm was designed and was not used for parameter tuning. The beginning/end time of each word was obtained by the proposed two-step approach. Words were labelled with respective text accordingly, with /r/ for /red/ and /rot/, /g/ for /green/, and /grün/ and /b/ for /blue/ and /blau/. Words that were difficult to annotate for some reasons (e.g. imprecise syllable separations, respirations, other words etc.) were labelled as /n/, as a “garbage” class.

    [0211] Outcome measures: Based on the word segmentation and classification results, we two complementary test-level outcome measures were designed: the number of correct words for quantifying processing speed as part of the cognitive measures and the speech rate for quantifying speech motor performance. In particular, the speech rate was defined as the number of words per second and computed as the slope of the regression line on the cumulative sum of segmented words in time.

    [0212] Statistical analyses: The Shapiro-Wilk test was used to test for a normal distribution. Pearson correlation was applied to examine significant relationships. The criteria used to evaluate Pearson correlation coefficient were fair (values of 0.25-0.5), moderate to good (values of 0.5-0.75) and excellent (values of 0.75 and above). ANOVA and unpaired t-test for independent samples were performed for comparison between groups. Effect sizes were measured with Cohen’s d with d = 0.2 indicating a small, d = 0.5 a medium and d = 0.8 a large effect.

    Results

    [0213] Evaluation of word classification performance: To estimate the classification accuracy of the proposed model-free word recognition algorithm, manual annotations and labels obtained by the proposed automated algorithm were compared. The overall classification accuracy was high, with an average score of 0.83 in English and 0.85 in German. The normalized confusion matrices in FIG. 10 shows the performance of our model-free word classifier at word level. The high classification accuracy suggested that the proposed word recognizer could learn all the components of a speech recognizer including the pronunciation, acoustic and language content directly from a 45-second speech recording. It leverages an unsupervised classifier and a dynamic local sequence alignment strategy to tag each word. This means, during deployment, there is no need to carry around a language model making it very practical for applications in multi-language and various diseased population settings.

    [0214] Clinical validation of two complementary outcome measures: The number of correct words determined by the fully-automated approach was compared with the standard clinical UHDRS-Stroop word score. In general, in term of the number of correct words, the smartphone and clinical measures are highly correlated (Pearson’s correlation coefficient r = 0.81, p < 0.001) shown in FIG. 11A.

    [0215] The measures were further validated in the HD patient subgroups, who had speech impairments. Dysarthria, a corresponding clinical measure, appears as one item explicitly in the UHDRS motor evaluation section. It ranges between 0 and 4, with 1 being unclear till 4 being anarthria, unable to articulate speech at all. In the HD OLE study, there was only one patient who has dysarthria score above 1. Therefore, patients were grouped into two levels: normal speech (dysarthria score = 0, n = 30) and dysarthria speech (dysarthria score > 0, n = 16). Comparison between the normal speech subgroup and dysarthria speech subgroup showed that speech rate (words/sec; 1.8 ± 0.3 vs 1.5 ± 0.3; p < 0.01; Cohens’d = 1.086) and number of correct words (66.8 ± 15.9 vs 48.7 ± 16.1; p < 0.001; Cohens’d = 1.110) were both significantly reduced in dysarthria patients shown in FIG. 12. As the speech rate was directly derived from the timing of word boundaries regardless of actual labels, neither clustering nor alignment will affect the measure. In addition, the computation of the speech rate using the gradient of the regression line benefits from the entire speech. Therefore, the inventors believe that the speech rate provides a robust and sensitive measure about speech motor impairments especially in patients with dysarthria.

    [0216] A strong correlation was observed between speech rate and the number of correct words (Pearson’s correlation coefficient r = 0.83, p < 0.001, in FIG. 11B) which indicates that cognitive decline may exist side by side with speech motor impairments. Hence, extracting two complementary measures from one test is useful in understanding the interrelationships.

    [0217] Evaluation of performance in further languages: the results obtained in this study were further expanded upon in a study including HD patients speaking 10 different languages. In particular, the methods described in this example were applied to this multi-lingual cohort using the following words: ‘English’: [‘RED’, ‘GREEN’, ‘BLUE’], ‘German’: [‘ROT’, ‘GRUN’, ‘BLAU’], ‘Spanish’: [‘ROJO’, ‘VERDE’, ‘AZUL’], ‘French’: [‘ROUGE’, ‘VERT’, ‘BLEU’], ‘Danish’: [‘RØD’, ‘GRØN’, ‘BLÅ’], ‘Polish’: [‘CZERWONY’,‘ZIELONY’, ‘NIEBIESKI’], ‘Russian’:custom-character ‘Japanese’: custom-character‘Italian’: [‘ROSSO’, ‘VERDE’, ‘BLU’], ‘Dutch’: [‘ROOD’, ‘GROEN’, ‘BLAUW’]. Of note, for some of these languages all of the words used were monosyllabic (e.g. English, German), whereas for FIG. 13A shows the distribution of number of correctly read words determined from sets of recordings in English, French, Italian and Spanish, and FIG. 13B shows the distribution of the number of segments identified (directly prior to clustering, i.e. after refinement and outlier removal) in each of these languages. The data shows that the number of correctly read words identified according to the method described above is robust to variations in the length of the words (FIG. 13A), even though multiple syllables in single words are identified as separate entities (FIG. 13B).

    Conclusion

    [0218] In this example, the present inventors developed and showed the clinical applicability of an automated (smartphone-based) Stroop word-reading test that can be self-administered remotely from patient’s home. The fully-automated approach enables to run offline analysis of speech data and allows to assess cognitive function and speech motor function in patients with HD. The approach is language-independent using an unsupervised classifier and a dynamic local sequence alignment strategy to tag each word with respect to language content. Words were classified with a high overall accuracy of 0.83 in English speaking and 0.85 in German speaking patients, without any pre-trained models. Two complementary outcome measures were clinically validated, one for assessing cognitive capability and one for evaluating speech motor impairments, in 46 patients of the HD OLE study. The number of correct words showed excellent correlation with the clinical score measured as part of the UHDRS cognitive test. A reduction of speech rate as well as worse cognitive score were pronounced in subgroup of HD patients with dysarthria speech symptoms. In summary, the approach described herein succeeded to set the ground for self-assessment of disease symptoms using smartphone based speech tests in large populations. This may ultimately bring great benefit for patients to improve quality of life for most and clinical trials to find effective treatments.

    Example 2: Automated Web-based Stroop Word-reading Test for the Remote Monitoring of Disease Symptoms in Heart Failure Patients

    [0219] In this example, the inventors implemented the automated Stroop word-reading test (SWR) described above in the context of remote monitoring of disease symptoms in heart failure patients. The same set up as in Example 1 was used, except that the solution was deployed through a web based application, and that recordings of 40 words (and variable lengths of time) were used instead of recordings of 45 seconds. This is because many patients did not have the physical strength to perform long tests. Two recordings (i.e. 80 words in total) were combined and used for each patient, in order to ensure that the clustering step is performed using enough words to have excellent accuracy. The segment identification steps were performed separately for the two recordings, as was the alignment step. However, the clustering step was performed using the data from both recordings. The number of correct words were normalised to take into account the test duration, and the resulting normalised counts were used as the outcome measure.

    [0220] Further, in addition to the consistent condition (word count), the interference part of the Stroop word-reading test was assessed as described in Example 1 (except that the words were displayed in inconsistent colours) - as described in Example 3. The inventors found that the outcome measures discussed in Example 1 (speech rate and the number of correct words, for the consistent part of the word-reading test and the interference part of the word reading test) were indicative of the patient’s shortness of breath and tiredness. These could in turn be used as an indication of the patient’s heart function.

    Example 3: Automated Stroop Word-reading Test - Interference Condition

    [0221] In this example, the inventors tested whether the approach outlined in Example 1 could be used to automatically perform the interference part of the Stroop word-reading test. A cohort of healthy volunteers underwent both a Stroop word reading test as described in relation to Example 1, and a Stroop colour word reading test. Further, the inventors tested the performance of the method by analysing recordings for a Stroop word reading test and a Stroop colour word reading test using the same sequence of words, the words being displayed in black for the former and in inconsistent colours for the latter (see FIGS. 14A and 14B). The results of applying the methods described in Example 1 to the two voice recordings obtained from an individual performing those matched tests are shown in FIGS. 14A and 14B. In these figures, segments are highlighted in the middle panel of each figure as coloured sections of signal, and the word predictions are indicated in the middle panel of each figure by the colour of the segments. The data shows that the segment identification and correct word counting processes performs equally well for both the consistent condition and the interference condition. Indeed, there is no discrepancy in cluster assignment between the word reading and interference tests, despite the presence of incorrect words read by the individual in the interference tests. Further, as can also be seen on FIG. 14B, the inventors found that the predicted numbers of correctly read words obtained using the presently described automated assessment method highly correlated with the ground truth data obtained by manual annotation of the voice recordings.

    REFERENCES

    [0222] 1. Roos, R.A., Huntington’s disease: a clinical review. Orphanet J Rare Dis, 2010. 5: p. 40.

    [0223] 2. Unified Huntington’s Disease Rating Scale: reliability and consistency. Huntington Study Group. Mov Disord, 1996. 11(2): p. 136-42.

    [0224] 3. Stroop, J.R., Studies of interference in serial verbal reactions. Journal of Experimental Psychology, 1935. General(18): p. 19.

    [0225] 4. Snowden, J., et al., Longitudinal evaluation of cognitive disorder in Huntington’s disease. J Int Neuropsychol Soc, 2001. 7(1): p. 33-44.

    [0226] 5. Tabrizi, S.J., et al., Biological and clinical changes in premanifest and early stage Huntington’s disease in the TRACK-HD study: the 12-month longitudinal analysis. Lancet Neurol, 2011. 10(1): p. 31-42.

    [0227] 6. Stout, J.C., et al., Evaluation of longitudinal 12 and 24 month cognitive outcomes in premanifest and early Huntington’s disease. J Neurol Neurosurg Psychiatry, 2012. 83(7): p. 687-94.

    [0228] 7. Tabrizi, S.J., et al., Potential endpoints for clinical trials in premanifest and early Huntington’s disease in the TRACK-HD study: analysis of 24 month observational data. Lancet Neurol, 2012. 11(1): p. 42-53.

    [0229] 8. Toh, E.A., et al., Comparison of cognitive and UHDRS measures in monitoring disease progression in Huntington’s disease: a 12-month longitudinal study. Transl Neurodegener, 2014. 3: p. 15.

    [0230] 9. Kenneth, D.J., Temporal constraints and characterising syllable structuring. Phonetic Interpretation: Papers in Laboratory Phonology VI., 2003: p. 253-268.

    [0231] 10. Xie, Z.M. and P. Niyogi, Robust Acoustic-Based Syllable Detection. Interspeech 2006 and 9th International Conference on Spoken Language Processing, Vols 1-5, 2006: p. 1571-1574.

    [0232] 11. Wang, D. and S.S. Narayanan, Robust speech rate estimation for spontaneous speech. Ieee Transactions on Audio Speech and Language Processing, 2007. 15(8): p. 2190-2201.

    [0233] 12. Rusz, J., et al., Quantitative assessment of motor speech abnormalities in idiopathic rapid eye movement sleep behaviour disorder. Sleep Med, 2016. 19: p. 141-7.

    [0234] 13. Böck, S. and G. Widmer, Maximum filter vibrato suppression for onset detection. 16th International Conference on Digital Audio Effects, Maynooth, Ireland, 2013.

    [0235] 14. Davis, S.B. and P. Mermelstein, Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. Ieee Transactions on Acoustics Speech and Signal Processing, 1980. 28(4): p. 357-366.

    [0236] 15. Huang, X., A. Acero, and H. Hon, Spoken Language Processing: A guide to theory, algorithm, and system development. Prentice Hall, 2001.

    [0237] 16. Rusz, J., et al., Automatic Evaluation of Speech Rhythm Instability and Acceleration in Dysarthrias Associated with Basal Ganglia Dysfunction. Front Bioeng Biotechnol, 2015. 3: p. 104.

    [0238] 17. Lloyd, S.P., Least-Squares Quantization in Pcm. Ieee Transactions on Information Theory, 1982. 28(2): p. 129-137.

    [0239] 18. Smith, T.F. and M.S. Waterman, Identification of common molecular subsequences. J Mol Biol, 1981. 147(1): p. 195-7.

    [0240] 19. Hlavnicka, J., et al., Automated analysis of connected speech reveals early biomarkers of Parkinson’s disease in patients with rapid eye movement sleep behaviour disorder. Sci Rep, 2017. 7(1): p. 12.

    [0241] 20. Skodda, S., et al., Impaired motor speech performance in Huntington’s disease. J Neural Transm (Vienna), 2014. 121(4): p. 399-407.

    [0242] 21. McFee, B. et al., librosa: Audio and Music Signal Analysis in Python. PROC. OF THE 14th PYTHON IN SCIENCE CONF. (SCIPY 2015).

    [0243] 22. James Lyons et al. (2020, January 14). jameslyons/python_speech_features: release v0.6.1 (Version 0.6.1). Zenodo. http://doi.orq/10.5281/zenodo.3607820

    [0244] All documents mentioned in this specification are incorporated herein by reference in their entirety.

    [0245] The terms “computer system” includes the hardware, software and data storage devices for embodying a system or carrying out a method according to the above described embodiments. For example, a computer system may comprise a central processing unit (CPU), input means, output means and data storage, which may be embodied as one or more connected computing devices. Preferably the computer system has a display or comprises a computing device that has a display to provide a visual output display (for example in the design of the business process). The data storage may comprise RAM, disk drives or other computer readable media. The computer system may include a plurality of computing devices connected by a network and able to communicate with each other over that network.

    [0246] The methods of the above embodiments may be provided as computer programs or as computer program products or computer readable media carrying a computer program which is arranged, when run on a computer, to perform the method(s) described above.

    [0247] The term “computer readable media” includes, without limitation, any non-transitory medium or media which can be read and accessed directly by a computer or computer system. The media can include, but are not limited to, magnetic storage media such as floppy discs, hard disc storage media and magnetic tape; optical storage media such as optical discs or CD-ROMs; electrical storage media such as memory, including RAM, ROM and flash memory; and hybrids and combinations of the above such as magnetic/optical storage media.

    [0248] Unless context dictates otherwise, the descriptions and definitions of the features set out above are not limited to any particular aspect or embodiment of the invention and apply equally to all aspects and embodiments which are described.

    [0249] “and/or” where used herein is to be taken as specific disclosure of each of the two specified features or components with or without the other. For example “A and/or B” is to be taken as specific disclosure of each of (i) A, (ii) B and (iii) A and B, just as if each is set out individually herein.

    [0250] It must be noted that, as used in the specification and the appended claims, the singular forms “a,” “an,” and “the” include plural referents unless the context clearly dictates otherwise. Ranges may be expressed herein as from “about” one particular value, and/or to “about” another particular value. When such a range is expressed, another embodiment includes from the one particular value and/or to the other particular value. Similarly, when values are expressed as approximations, by the use of the antecedent “about,” it will be understood that the particular value forms another embodiment. The term “about” in relation to a numerical value is optional and means for example +/- 10%.

    [0251] Throughout this specification, including the claims which follow, unless the context requires otherwise, the word “comprise” and “include”, and variations such as “comprises”, “comprising”, and “including” will be understood to imply the inclusion of a stated integer or step or group of integers or steps but not the exclusion of any other integer or step or group of integers or steps.

    [0252] Other aspects and embodiments of the invention provide the aspects and embodiments described above with the term “comprising” replaced by the term “consisting of” or “consisting essentially of”, unless the context dictates otherwise.

    [0253] The features disclosed in the foregoing description, or in the following claims, or in the accompanying drawings, expressed in their specific forms or in terms of a means for performing the disclosed function, or a method or process for obtaining the disclosed results, as appropriate, may, separately, or in any combination of such features, be utilised for realising the invention in diverse forms thereof.

    [0254] While the invention has been described in conjunction with the exemplary embodiments described above, many equivalent modifications and variations will be apparent to those skilled in the art when given this disclosure. Accordingly, the exemplary embodiments of the invention set forth above are considered to be illustrative and not limiting. Various changes to the described embodiments may be made without departing from the spirit and scope of the invention.

    [0255] For the avoidance of any doubt, any theoretical explanations provided herein are provided for the purposes of improving the understanding of a reader. The inventors do not wish to be bound by any of these theoretical explanations.

    [0256] Any section headings used herein are for organizational purposes only and are not to be construed as limiting the subject matter described.