Automatic Detection Of Neurocognitive Impairment Based On A Speech Sample

Abstract

The invention is a method for automatic detection of neurocognitive impairment, comprising, generating, in a segmentation and labelling step (11), a labelled segment series (26) from a speech sample (22) using a speech recognition unit (24); and generating from the labelled segment series (26), in an acoustic parameter calculation step (12), acoustic parameters (30) characterizing the speech sample (22).

The method is characterised by determining, in a probability analysis step (14), in a particular temporal division of the speech sample (22), respective probability values (38) corresponding to silent pauses, filled pauses and any types of pauses for respective temporal intervals thereof; calculating, in an additional parameter calculating step (15), a histogram by generating an additional histogram data set (42) from the determined probability values (38) by dividing a probability domain into subdomains and aggregating durations of the temporal intervals corresponding to the probability values falling into the respective subdomains; and generating, in an evaluation step (13), decision information (34) by feeding the acoustic parameters (30) and the additional histogram data set (42) into an evaluation unit (32), the evaluation unit (32) using a machine learning algorithm.

The invention is furthermore data processing system, a computer program product and a computer-readable storage medium for carrying out the method.

Claims

1. A method for automatic detection of neurocognitive impairment, comprising, generating, in a segmentation and labelling step (11), a labelled segment series (26) from a speech sample (22) using a speech recognition unit (24); and generating from the labelled segment series (26), in an acoustic parameter calculation step (12), acoustic parameters (30) characterizing the speech sample (22); characterised by determining, in a probability analysis step (14), in a particular temporal division of the speech sample (22), respective probability values (38) corresponding to silent pauses, filled pauses and any types of pauses for respective temporal intervals thereof; calculating, in an additional parameter calculating step (15), a histogram by generating an additional histogram data set (42) from the determined probability values (38) by dividing a probability domain into subdomains and aggregating durations of the temporal intervals corresponding to the probability values falling into the respective subdomains; and generating, in an evaluation step (13), decision information (34) by feeding the acoustic parameters (30) and the additional histogram data set (42) into an evaluation unit (32), the evaluation unit (32) using a machine learning algorithm.

2. The method according to claim 1, characterised by applying phoneme labels, silent pause labels and filled pause labels in the segmentation and labelling step (11).

3. The method according to claim 2, characterised by applying separate respective phoneme labels for each phoneme in the segmentation and labelling step (11).

4. The method according to claim 1, characterised by dividing, for the probability analysis step (14), the speech sample (22) into temporal intervals of identical lengths.

5. The method according to claim 4, characterised by dividing the speech sample (22) into overlapping temporal intervals.

6. The method according to claim 5, characterised by applying temporal intervals having a length of 10-50 ms and overlapping with each other to an extent of 20-50%.

7. The method according to claim 4, characterised by determining, in the segmentation and labelling step (11), elements and labels of the labelled segment series (26) based on a highest probability value occurring in the corresponding temporal intervals.

8. The method according to claim 1, characterised by dividing, in the additional parameter calculating step (15), the entire probability domain into subdomains of equal size, preferably into at least ten subdomains, more preferably into twenty or fifty subdomains, for calculating the histogram.

9. The method according to claim 1, characterised by calculating, in the additional parameter calculating step (15), a cumulative histogram, by generating the elements of the additional histogram data set (42) from intervals having a probability that is higher than a lower limit of the respective probability subdomain, or from intervals having a probability that is lower than an upper limit of the respective probability subdomain.

10. The method according to claim 1, characterised by applying, in the evaluation step (13), a statistical method, preferably a two-sample statistical t-test, for generating the decision information (34).

11. The method according to claim 1, characterised by determining a decision limit and an error limit of the decision limit as parts of the decision information (34).

12. The method according to claim 11, characterised by determining the decision limit using the machine learning algorithm of the evaluation unit (32), preferably taking into account a bias of the machine learning algorithm, more preferably taking into account an amount of the training data corresponding to a given decision group of the machine learning algorithm, and/or according to a predetermined sensitivity or specificity.

13. The method according to claim 1, characterised by using, as an acoustic parameter (30), at least one parameter selected from the group consisting of a total duration of the speech sample (22), a speech rate, an articulation rate, a number of silent pauses, a number of filled pauses, a number of all pauses, a total length of silent pauses, a total length of filled pauses, a total length of all pauses, an average length of silent pauses, an average length of filled pauses, an average length of all pauses, a ratio of silent pauses in the speech sample (22), a ratio of filled pauses in the speech sample (22), a ratio of all pauses in the speech sample (22), a value obtained by dividing the number of silent pauses by the total duration of the speech sample (22), a value obtained by dividing the number of filled pauses by the total duration of the speech sample (22), and a value obtained by dividing the number of all pauses by the total duration of the speech sample (22).

14. The method according to claim 1, characterised by applying, in the evaluation step (13), an evaluation unit (32) using a “Naive Bayes” (NB), linear “Support Vector Machine” (SVM) or “Random Forest”-type machine learning algorithm.

15. The method according to claim 1, characterised by providing the speech sample (22) in a speech sample generation step (10).

16. Data processing system characterised by comprising, for performing the steps according to claim 1, a speech recognition unit (24) adapted for generating a labelled segment series (26) from a speech sample (22), a parameter extraction unit (28) adapted for extracting acoustic parameters (30) from the labelled segment series (26) and connected to an output of the speech recognition unit (24), an additional parameter extraction unit (40) connected to the output of the speech recognition unit (24) and adapted for generating an additional histogram data set (42), and an evaluation unit (32) connected to an output of the parameter extraction unit (28) and an output of the additional parameter extraction unit (40), and adapted for performing the evaluation of the acoustic parameters (30) and the additional histogram data set (42).

17. The data processing system according to claim 16, characterised by further comprising a sound recording unit (20) connected to an input of the speech recognition unit (24), and/or a display unit (36) connected to an output of the evaluation unit (32), and/or a database, the database being interconnected with the sound recording unit (20), with the speech recognition unit (24), with the parameter extraction unit (28), with the additional parameter extraction unit (40), with the evaluation unit (32) and/or with the display unit (36).

18. Computer program product, characterised in that it comprises instructions which, when executed by a computer, cause the computer to carry out the steps of the method of claim 1.

19. Computer-readable storage medium, characterised in that it comprises instructions which, when executed by a computer, cause the computer to carry out the steps of the method of claim 1.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0027] Preferred embodiments of the invention are described below by way of example with reference to the following drawings, where

[0028] FIG. 1 is a flow diagram of a preferred embodiment of the method according to the invention,

[0029] FIG. 2 is the block diagram of a preferred embodiment of the system performing the method according to the invention,

[0030] FIG. 3 is a diagram showing the probabilities estimated for silent pauses, filled pauses and for any pauses combined, over a given temporal interval of a speech sample,

[0031] FIG. 4 shows the estimated probability values for filled pauses of the speech sample according to FIG. 3, showing probability limit values as horizontal lines,

[0032] FIG. 5 shows a histogram and a cumulative histogram calculated from the values of FIG. 4,

[0033] FIG. 6 shows, as a function of the probability limits indicated in FIG. 4, the significance values calculated for the probability values of FIG. 3 from the histograms according to FIG. 5,

[0034] FIG. 7 shows the results of MMSE, ADAS-Cog and clock drawing tests in pharmacological tests related to the therapy of Alzheimer's disease, and

[0035] FIGS. 8A-8F show the results of tests (indicating the values of certain acoustic parameters) related to the therapy of FIG. 7 that were carried out by applying the method according to the invention.

MODES FOR CARRYING OUT THE INVENTION

[0036] In FIG. 1 there can be seen a flow diagram of a preferred embodiment of the method according to the invention. In the course of the method, in a speech sample generation step 10 a speech sample is generated, preferably by applying a microphone.

[0037] In a segmentation and labelling step 11, the speech sample is processed, generating a labelled segment series from the speech sample by using a speech recognition unit. Any type of speech recognition system (even a prior art one) can be preferably applied as a speech recognition unit. The features of the speech recognition unit, as well as the requirements therefor, are described in more detail in relation to FIG. 2.

[0038] The labelled segment series includes phoneme labels, silent pause labels, and filled pause labels as different label types, and also contains the initial and final time instances for each label. Preferably, a separate phoneme label is assigned to each phoneme, but the invention also works properly applying such a simplified solution wherein a common phoneme label is applied for all phonemes, i.e. wherein different phonemes are not treated separately, and only the pauses (without a filler, i.e. silent pauses) of speech are taken into account. Silent pauses are temporal intervals of the speech sample wherein no speech can be heard, i.e. they contain silence. Filled pauses are such temporal intervals of the speech sample that are not silent but where speech is also not produced. These pauses, which are typically filled with some speech sound or filler, reflect the hesitation of the speaker/user. For example, filled pauses can be verbal hesitations such as “hmm”, “uhm”, or the using of other “hesitation sounds” or fillers (in Hungarian, “ööö”-ing, or in the case of English speakers, saying ‘um’, ‘uh’, or ‘er’). Preferably, the labels are the phoneme labels corresponding to the speech sounds of the language spoken in the given speech sample, complemented with the silent pause and filled pause labels.

[0039] In an acoustic parameter calculation step 12, acoustic parameters characteristic of the speech sample are generated from the labelled segment series, which acoustic parameters are, in an evaluation step 13, fed into an evaluation unit that applies a machine learning algorithm for producing decision information. Based on the decision information, it can be decided if the person providing the speech sample is potentially affected by neurocognitive impairment. The evaluation unit adapted for providing the decision information is also described in more detail in relation to FIG. 2.

[0040] In the course of processing the speech sample, in a probability analysis step 14, respective probability values are determined corresponding to silent pauses, filled pauses and any types of pauses (i.e. either silent or filled pauses) for respective temporal intervals of a particular temporal division of the speech sample.

[0041] The probability values obtained in the probability analysis step 14 are preferably used for generating the labelled segment series in the segmentation and labelling step 11, in the course of which the labels of the labelled segment series are determined based on the highest-probability phonemes (including the silent and filled pauses), the segment boundaries being given by the outer boundaries of adjacent temporal intervals that can be labelled with the same label.

[0042] In an additional parameter calculating step 15, the probability values determined in the probability analysis step 14 are used for calculating a histogram by generating an additional histogram data set from the determined probability values through dividing the probability domain into subdomains and aggregating the durations of the temporal intervals corresponding to the probability values falling into the respective subdomains. In the evaluation step 13, decision information is generated by feeding the acoustic parameters and the additional histogram data set into the evaluation unit that applies a machine learning algorithm. The combined evaluation, in the evaluation step 13, of the acoustic parameters derived from the speech sample and the additional histogram data set significantly increases the accuracy of the decision information.

[0043] In FIG. 2, a preferred embodiment of the data processing system adapted for carrying out the steps of the method according to the invention is illustrated. For performing the steps of the method, the data processing system comprises a speech recognition unit 24 adapted for generating a labelled segment series 26 from a speech sample 22, and a parameter extraction unit 28 that is connected to the output of the speech recognition unit 24 and is adapted for extracting acoustic parameters 30 from the labelled segment series 26. The data processing system further comprises an additional parameter extraction unit 40 that is connected to the output of the speech recognition unit 24 and is adapted for generating an additional histogram data set 42, and an evaluation unit 32 that is connected to the outputs of the parameter extraction unit 28 and the additional parameter extraction unit 40 and is adapted for performing the evaluation of the acoustic parameters 30 and the additional histogram data set 42.

[0044] The data processing system preferably comprises a sound recording unit 20 connected to the input of the speech recognition unit 24, and/or a display unit 36 connected to the output of the evaluation unit 32, and/or a database, the database being interconnected with the sound recording unit 20, with the speech recognition unit 24, with the parameter extraction unit 28, with the additional parameter extraction unit 40, with the evaluation unit 32 and/or with the display unit 36.

[0045] In the speech sample generation step 10 of the method (as shown in FIG. 1), a speech sample 22 of human speech is recorded, preferably applying the sound recording unit 20, followed by passing on the recorded speech sample 22 to the speech recognition unit 24 and/or the database (the database is not shown in the figure). The sound recording unit 20 is preferably a telephone or a mobile phone, more preferably a smartphone or a tablet, while the sound recording unit 20 can also be implemented as a microphone or a voice recorder. The sound recording unit 20 is preferably also adapted for conditioning and/or amplifying the speech sample 22.

[0046] The recorded speech sample 22 is processed by the speech recognition unit 24, wherein the speech sample 22 is either retrieved from the sound recording unit 20 or from the database. The speech recognition unit 24 can be implemented by applying a speech recognition system that is known from the prior art and includes commercially available hardware, software, or a combination thereof. Speech recognition systems are typically adapted for performing statistical pattern recognition, which involves, in a training phase, the estimation of the distribution of data belonging into various classes, for example, speech sounds by using of a large amount of training data, and, in a testing phase, the classification of a new data point of unknown class, usually based on the so-called Bayes decision rule, i.e. on the highest probability values. A requirement for the speech recognition system applied in the method according to the invention is that it should generate an output consisting of phonemes (as opposed to words), also indicating the start and end point of each phoneme. A labelled segment series 26 that is segmented by applying the initial and final time instances of the phonemes and is labelled to indicate the phoneme included in the given segment of the speech sample 22 is generated by the output of the speech recognition unit 24 (segmentation and labelling step 11). The speech recognition unit 24 is further adapted to apply labels, in addition to the phonemes corresponding to the language of the speech sample 22, also to the silent pauses and the filled pauses. Thus, the labelled segment series 26 also contains segment information related to the silent and filled pauses.

[0047] Preferably, the speech recognition system that is known in the field as HTK and is publicly available free of charge (http://htk.eng.cam.ac.uk/) can be applied as the speech recognition unit 24.

[0048] The labelled segment series 26 is passed on by the speech recognition unit 24 to the parameter extraction unit 28 and/or to the database.

[0049] In addition to the above, in a probability analysis step 14, the speech recognition unit 24 determines, in a particular temporal division of the speech sample 22, respective probability values 38 corresponding to silent pauses, filled pauses and all types of pauses for pre-defined temporal intervals of the sample. For determining the probability values 38, the speech sample 22 is divided by the speech recognition unit 24 in a predetermined manner into temporal intervals, preferably temporal intervals of identical length, more preferably, overlapping temporal intervals. In a particularly preferable realization, the speech sample 22 is divided by the speech recognition unit 24 into temporal intervals having a length of 10-50 ms and an overlap of 20-50%. For example, temporal intervals having a length of 25 ms and an overlap of 10 ms are applied.

[0050] For each temporal interval, the probability values of all the phonemes, and also of silent and filled pauses, are determined by the acoustic model of the speech recognition unit 24. FIG. 3 shows, as a function of time, the respective probability values 38 determined for a particular temporal interval of the speech sample 22 for silent pauses, filled pauses and aggregately for any types of pauses.

[0051] In a preferred embodiment, the probability values 38 obtained in the probability analysis step 14 are used for determining labels and segment boundaries in the segmentation and labelling step 11 (indicated with a dashed arrow in FIG. 1). In this case, the highest-probability phonemes (complemented by the silent and filled pauses) corresponding to each temporal interval will become the labels included in the labelled segment series 26, while the segment boundaries can be determined from the boundaries of the temporal intervals. In another preferred embodiment, the labels and segment boundaries of the labelled segment series 26 can also be obtained applying an alternative, known method.

[0052] The parameter extraction unit 28 generates acoustic parameters 30 characteristic of the speech sample 22 from the labelled segment series 26, which labelled segment series 26 is received from the speech recognition unit 24 or retrieved from the database. The generated acoustic parameters 30 are passed on by the parameter extraction unit 28 to the database or to the evaluation unit 32.

[0053] A characteristic feature of the acoustic parameters 30 extracted by the parameter extraction unit 28 is that they can be computed from the length of the particular segments of the labelled segment series 26, wherein a value of at least one of the acoustic parameters 30 being significantly different in the case of healthy subjects and patients presumably exhibiting neurocognitive impairment. The acoustic parameters 30 therefore contain information applicable for distinguishing healthy subjects from patients presumably exhibiting neurocognitive impairment. A comparison of acoustic parameters 30 extracted from speech samples 22 from patients exhibiting neurocognitive impairment and from speech samples from a control group was performed via a statistical method. A two-sample statistical t-test was preferably applied as a statistical method. The significance values of the two-sample statistical t-test calculated for the acoustic parameters 30 of patients exhibiting neurocognitive impairment and of the control group are summarized in Table 1. The definitions of the acoustic parameters 30 included in Table 1 are summarized in Table 2. In Table 1, the acoustic parameters 30 for which the significance values are lower than 0.05, i.e. in case of which a significant difference can be detected between patients exhibiting neurocognitive impairment and the control group are shown in bold type.

TABLE-US-00001 TABLE 1 Acoustic parameters 30 and associated significance levels significance Acoustic parameter value total duration of speech sample (ms) 0.0005 speech rate (1/s) 0.1346 articulation rate (1/s) 0.1073 number of silent pauses 0.0018 number of filled pauses 0.0011 number of all pauses 0.0008 total length of silent pauses (ms) 0.0037 total length of filled pauses (ms) 0.0011 total length of all pauses 0.0014 silent pauses/duration of speech sample (%) 0.3850 filled pauses/duration of speech sample (%) 0.0398 all pauses/duration of speech sample (%) 0.2294 number of silent pauses/duration of speech sample 0.1607 (%) number of filled pauses/duration of speech sample 0.1160 (%) number of all pauses/duration of speech sample 0.3861 (%) average length of silent pauses (ms) 0.1247 average length of filled pauses (ms) 0.1308 average length of all pauses (ms) 0.0913

TABLE-US-00002 TABLE 2 Definition of acoustic parameters 30 Parameter Definition articulation rate number of articulated speech sounds per second during the sample excluding the duration of pauses (1/s) speech rate number of speech sounds per second during the sample including the duration of pauses (1/s) number of pauses number of all pauses during the total duration of the speech sample duration of pauses combined duration of all pauses during the speech sample (ms) pauses/full duration of speech combined duration of all pauses/ sample total duration of speech sample (%) pause ratio the ratio of the number of all pauses to the number of all segments detected over the total duration of the speech sample (%) average length of pauses average length of pauses calculated for all pauses (ms)

[0054] As it can be discerned from Table 1, the number of silent pauses, filled pauses, and the aggregate number of pauses (any pauses), as well as the combined duration of the pauses shows a significant difference between the group with neurocognitive impairment and the control group. Another acoustic parameter 30 that also shows significant difference is the ratio of the length of filled pauses to the total duration of the speech sample 22, so a further examination of filled pauses seems particularly preferable. Because filled pauses can be easily confused with certain speech sounds—in Hungarian, particularly the sounds ‘ö’ /ø/, /ø:/, ‘m’ and ‘n’, and in English, particularly the sounds ‘ah’ or ‘uh’ /Λ/, ‘er’ /3:/ and ‘um’ /Λm/—it is also preferable to specify certain parameters similar to the parameters of filled pauses also for these phonemes (i.e. number of occurrences, combined length and average length of occurrences, the deviation of lengths), and add these to the existing set of acoustic parameters 30.

[0055] In the additional parameter calculating step 15 according to the invention, additional parameters characteristic of the speech sample 22 are calculated, in the course of which calculation an additional histogram data set 42 is generated from the probability values 38. In the preferred embodiment according to FIG. 2, the additional histogram data set 42 is generated by the additional parameter extraction unit 40. The additional parameter extraction unit 40 is preferably a device having a computational capacity, for example a computer, a tablet computer, or a mobile phone. In a preferred embodiment, the additional parameter extraction unit 40 and the parameter extraction unit 28 can be implemented in a single device, wherein the computational operations of the acoustic parameter calculation step 12 and the step of calculating additional parameters 15 are executed sequentially or in parallel. The steps of generating the additional histogram data set 42 are described in detail in relation to FIGS. 3-6.

[0056] The acoustic parameters 30 and the additional histogram data set 42 utilized for generating decision information 34 are retrieved by the evaluation unit 32 either from the database or from the parameter extraction unit 28 and the additional parameter extraction unit 40. Based on the decision information 34 it can be decided whether the person producing the speech sample 22 under examination is healthy or presumably suffers from neurocognitive impairment. The decision information 34 preferably also contains a decision limit and the corresponding error margins, which allows for more sophisticated decision-making. The decision information 34 is preferably displayed by a display unit 36 that is preferably implemented as a device having a screen, i.e. a smartphone or tablet.

[0057] The evaluation unit 32 used for generating the decision information 34 preferably applies a trained machine learning algorithm, and more preferably applies a “Naive Bayes” (NB), linear “Support Vector Machine” (SVM) or “Random Forest”-type machine learning algorithm. Due to the low number of speech samples 22 gathered from known patients with neurocognitive impairment, preferably such a machine learning algorithm can be used in the training phase of the evaluation unit 32 that can be reliably applied even with a small amount (typically, fewer than 100 samples) of training data. For such a small amount of training data, it is preferable to apply the SVM and RF algorithms as a machine learning algorithm. The machine learning algorithms have to be trained using speech samples 22 from patients exhibiting neurocognitive impairment and from a healthy control group. The effectiveness of machine learning algorithms can usually be increased when training data having a multitude of well chosen characteristics is available. We have recognised that the additional histogram data set 42 is capable of describing the speech sample 22 in such a manner that increases the accuracy and effectiveness of the decisions made by the machine learning algorithm applied in the evaluation unit 32. The speech sample 22 is described by the additional histogram data set 42 with parameters that are independent of the length of the speech sample 22 and are defined based on probability values. This is a different approach with respect to the earlier one applying the acoustic parameters 30, and yields additional training information in relation to the pauses contained in the speech sample 22. By training the machine learning algorithm applying both the acoustic parameters 30 and the additional histogram data set 42, the machine learning algorithm operates more effectively and makes higher-quality decisions.

[0058] After training the machine learning algorithm, the evaluation unit 32 uses the trained algorithm for generating the decision information 34. To increase the reliability of the method, the machine learning algorithm can be trained further from time to time applying the new data that are stored in the database in the course of the method.

[0059] By generating the decision information 34, the evaluation unit 32 determines whether the acoustic parameters 30 of the speech sample 22 under examination and the additional histogram data set 42 are closer to the corresponding parameters of the speech sample 22 of the group with neurocognitive impairment, or those of the speech sample 22 corresponding to the control group. The trained machine learning algorithm applied by the evaluation unit 32 preferably assigns a respective probability value (a value between 0 and 1) to the events of the subject belonging to one or the other possible decision group; the sum of the probability values is 1. The trained machine learning algorithm of the evaluation unit 32 preferably also determines the decision limits (together with their error limits) corresponding to the decision information 34, so the decision information 34 preferably also contains the decision limits thus determined and the error limits thereof. The decision limit is a probability limit above which the subject producing the speech sample 22 will be considered to belong to a specific group, for example the group of subjects with neurocognitive impairment, while under the limit the subject is considered to belong to the other group, for example the group of healthy subjects. In the following, the term “decision limit” is taken to mean the decision limit determined for the group with neurocognitive impairment, i.e. with a decision limit of 0.5, in case the trained machine learning algorithm of the evaluation unit 32 determines a probability value that is lower than 0.5 for belonging to the group with neurocognitive impairment, then the subject producing the given speech sample 22 will be classified as part of the control group, i.e. will be classified as healthy. If this probability value is 0.5 or higher, then the subject will be classified as part of the group having neurocognitive impairment.

[0060] If the trained machine learning algorithm is biased towards one of the groups, for example the training data applied for training the algorithm includes a higher number of samples belonging to one group, then the decision limit can be preferably set higher or lower than 0.5, depending on the direction of the bias of the trained machine learning algorithm. For example, if there is a bias towards the control group (the training data contain a higher number of samples from the control group), the decision limit is preferably set to a value that is lower than 0.5, while in another example, wherein the training data contain a higher number of samples from the group with neurocognitive impairment, the decision limit preferably should be set to a value that is higher than 0.5.

[0061] Other considerations can also be preferably taken into account for determining the decision limit, for example the probability of false decisions, i.e. the sensitivity and specificity of the decision. If false positives are to be avoided, i.e. our aim is to avoid that the speech sample 22 of a healthy subject is classified as part of the group of subjects with neurocognitive impairment, then the decision limit is expediently set higher than 0.5, for example, preferably to a value of 0.6. If, on the contrary, our aim is to identify as many subjects suffering from neurocognitive impairment as possible, even at the cost of falsely classifying some healthy subjects as unhealthy, then it is expedient to set the decision limit to a value lower than 0.5, for example, preferably to a value of 0.4.

[0062] The decision limit is preferably determined based on the expected sensitivity and specificity of the decision, preferably applying a separate set of test data that are different from the training data of the machine learning algorithm.

[0063] In FIGS. 3-6, an example related to a preferred implementation of the additional parameter calculating step 15 is presented. In the example, the speech sample 22 was divided into temporal intervals having a length of 25 ms, overlapping with each other over a duration of 10 ms, followed by determining, in the probability analysis step 14, the respective probability values 38 of silent pauses, filled pauses and all pauses for each temporal interval. These probability values 38 are shown in FIG. 3 for a temporal interval of the examined speech sample 22, in particular, the interval located between 9 and 16 seconds from the start of the speech sample 22. The probability of all pauses (any pauses) is obtained as the sum of the probability of a silent pause and the probability of a filled pause.

[0064] FIG. 4 shows the probability values relating to filled pauses as included in FIG. 3. In the figure, the probability domain is divided uniformly into twenty subdomains, so domain boundaries follow each other with a step size of 0.05. The values corresponding to the boundaries of probability domains are indicated in the figure by horizontal lines at the locations where the probability curve surpasses the values corresponding to the probability domain boundaries.

[0065] In FIG. 5, two examples for an additional histogram data set 42 calculated from FIG. 4 are presented. In the first example, the additional histogram data set 42 is generated from the temporal intervals falling between the probability domain boundaries shown in FIG. 4, preferably, the additional histogram data set 42 gives the ratio of temporal intervals having a probability between adjacent probability domain boundaries to the total duration of the speech sample 22, i.e. in the additional parameter calculating step 15 a histogram is generated from the probability values 38 relating to filled pauses. By way of example, the value of the additional histogram data set 42 corresponding to the probability domain boundary value of 0.4 gives the ratio of those temporal intervals to the total duration of the speech sample 22 during which temporal intervals the probability values 38 relating to filled pauses are higher or equal to 0.4 but do not reach the next probability domain boundary value of 0.45.

[0066] In a second example, the additional histogram data set 15 is generated from temporal intervals having a probability that is higher than the probability domain boundary values shown in FIG. 4, preferably, the additional histogram data set 42 gives the ratio of temporal intervals having a probability higher than the corresponding probability domain boundaries to the full length of the speech sample 22, i.e. in the step of calculating additional parameters 15 a cumulative histogram is generated from the probability values 38 relating to filled pauses. By way of example, the value of the additional histogram data set 42 corresponding to the probability domain boundary value of 0.4 gives the ratio of those temporal intervals to the total duration of the speech sample 22 during which temporal intervals the probability values 38 relating to filled pauses are higher or equal to 0.4. The cumulative histogram according to the second example can also be generated from the data series of the histogram calculated according to the first example, by determining the aggregate values of the quantities falling into the classes of the histogram corresponding to the respective probability domain boundaries for each class that is greater than the given class.

[0067] The additional histogram data set 42 can also be generated such that (not shown in FIG. 5) it is calculated from temporal intervals with a probability that is lower than the probability domain boundaries shown in FIG. 4, i.e. preferably the additional histogram data set 42 gives the ratio of temporal intervals having a probability lower than the corresponding probability domain boundaries to the full length of the speech sample 22. A cumulative histogram is also calculated in this case, however, the aggregation is performed in an order from smallest to largest probability values, i.e. an aggregate value is determined of quantities falling into the classes of the histogram of the first example for all classes that are not greater than the given class.

[0068] As with the acoustic parameters 30, the additional histogram data set 42 is characteristic of the speech sample 22 and of the subject giving the speech sample 22 as far as the presence of neurocognitive impairment is concerned. Significance values are determined for each element of the additional histogram data set 42. The significance values can be determined also by applying an arbitrary statistical method, preferably a two-sample statistical t-test. The additional histogram data set 42, however, usually contains a larger number of data than the acoustic parameters 30, for example, if twenty probability domains are applied, the additional histogram data set 42 has 57 elements, and with fifty probability domains it has 147 elements, because, in addition to the filled pauses, the histogram and/or cumulative histogram shown in FIG. 5 has to be generated also for silent pauses, and, aggregately, for any pauses.

[0069] The significance values calculated by a two-sample statistical t-test from the elements of the additional histogram data set 42 generated calculating a cumulative histogram are preferably visualized in a non-tabulated format, i.e. in a diagram/graph chart.

[0070] In FIG. 6, the significance values corresponding to an additional histogram data set 42 generated by calculating a cumulative histogram are shown as a function of the probability domain boundaries, showing the values corresponding to different types of pause in separate respective curves. The additional histogram data set 42 was generated for the filled pauses, silent pauses and any pauses of the speech sample 22 illustrated in FIG. 3 by calculating a cumulative histogram calculated with twenty probability domains.

[0071] In FIG. 6 there can be seen that in the case of silent pauses (dashed-line curve) the features falling between probability domain boundaries of approximately 0.75 and 0.95 show a significant deviation (p<0.05) between the speech samples 22 of the control group and patients with neurocognitive impairment. The reason for that is the silent pauses can be easily identified by the speech recognition unit 24, so the probability values 38 corresponding to real silent pauses are high, often exceeding the probability value limit of 0.8.

[0072] On the contrary, in the case of filled pauses (dotted-line curve) a significant deviation (p<0.05) between the speech samples 22 of the control group and the patients with neurocognitive impairment is obtained at most up to a probability domain boundary of 0.15. This is because it is much harder to identify filled pauses, as they can be easily confused with the phonemes representing hesitation (for example, ‘ö’, ‘m’ and ‘n’). Because of that, a relatively high probability value 38 that is still lower than 0.5 is often obtained for filled pauses. In such cases there is a danger that filled pauses are not labelled appropriately as filled pauses in the labelled segment series 26, but the given temporal intervals are instead labelled with a phoneme corresponding to a speech sound, and thus the features corresponding to filled pauses will be determined incorrectly during the calculation of acoustic parameters 30. In contrast to that, these filled pauses appear in the additional histogram data set 42.

[0073] The method according to the invention can be preferably applied also in pharmacological tests, because it is adapted to detect the occurrence of neurocognitive impairments in a more sensitive manner compared to known solutions, and therefore it can also be applied for measuring the patient's progress and the efficacy of therapy.

[0074] In the following, an exemplary application of the method according to the invention for monitoring the therapy of Alzheimer's disease is described. The currently feasible aim of the therapy of Alzheimer's is to slow the progress of the disease. One of the most frequently applied commercially available drugs is donepezil, an acetylcholinesterase inhibitor. Therefore, the method according to the invention was applied for examining the efficacy of donepezil therapy of patients with early-stage Alzheimer's disease. The diagnosis of Alzheimer's disease was made using the criteria laid down in DSM-5 (American Psychiatric Association, 2013). The average age of patients selected in the sample (n=10) was 75 years, their average education is 11.5 years, the sex ratio being 70% females to 30% males that reflects the general rate of occurrence of Alzheimer's disease. The cognitive tests were carried out twice during the study, at the onset of the therapy and following 3 months of taking donepezil. Of the standard psychometric tests, the MMSE, ADAS-Cog and clock drawing tests (all referred to in the introduction) were performed. The test results are summarized in FIG. 7. In the self-controlled study, performance in MMSE was not different after taking the acetylcholinesterase inhibitor for 3 months, and similarly, the effectiveness of the therapy could not be detected either with the ADAS-Cog or the clock drawing test.

[0075] In FIGS. 8A-8F, changes of the parameter values calculated with the method according to the invention are illustrated. The 3-month donepezil therapy significantly increased the duration of the speech samples 22 (FIG. 8A), and the speech rate (FIG. 8B), i.e. on average, the patients spoke more and more quickly during the test. As a result of the applied therapy, the number of silent pauses has decreased, and, although the number of filled pauses has slightly increased (FIG. 8C), the combined duration of silent pauses and filled pauses (FIG. 8D) and the ratio thereof to the total duration of the speech sample 22 (FIG. 8F) have both decreased. Therefore, the small increase in the number of filled pauses indicated in FIG. 8C was only the result of the increased speech duration. According to FIG. 8E, the number of any pauses during the total duration of the speech sample 22 (the combined number of silent and filled pauses) has significantly decreased as a result of the applied therapy.

[0076] In summary, it can be seen that, contrary to known methods, the method according to the invention is able to detect a significant difference that results from the 3-month donepezil therapy. Based on that, it can also be concluded that the sensitivity of the method according to the invention significantly surpasses the sensitivity of regularly applied test methods.

[0077] The computer program product according to the invention comprises instructions that, when the instructions are executed by a computer, cause the computer to carry out the steps of the method of to claim 1.

[0078] The computer-readable storage medium according to the invention comprises instructions that, when the instructions are executed by a computer, cause the computer to carry out the steps of the method of to claim 1.

[0079] The mode of industrial applicability of the invention follows from the essential features of the technical solution according to the foregoing description. It is apparent from the above description that the invention has fulfilled the objectives set before it in an exceedingly advantageous manner compared to the state of the art. The invention is, of course, not limited to the preferred embodiments described in details above, but further variants, modifications and developments are possible within the scope of protection determined by the claims.

Automatic Detection Of Neurocognitive Impairment Based On A Speech Sample

Assignee

Inventors

Cpc classification

Classification Explorer

G10L25/66

PHYSICS

Classification Explorer

A61B5/4088

HUMAN NECESSITIES

Classification Explorer

G10L15/04

PHYSICS

Classification Explorer

A61B5/7267

HUMAN NECESSITIES

Classification Explorer

G16H50/20

PHYSICS

Classification Explorer

A61B5/4803

HUMAN NECESSITIES

Classification Explorer

G10L2015/025

PHYSICS

Classification Explorer

G10L25/87

PHYSICS

International classification

Classification Explorer

A61B5/00

HUMAN NECESSITIES

Classification Explorer

G10L15/04

PHYSICS

Classification Explorer

G10L25/66

PHYSICS

Abstract

Claims

Description