SPEECH ANALYSIS ALGORITHMIC SYSTEM AND METHOD FOR OBJECTIVE EVALUATION AND/OR DISEASE DETECTION
20210193173 · 2021-06-24
Inventors
- Visar BERISHA (Tempe, AZ, US)
- Ming Tu (Tempe, AZ, US)
- Alan Wisler (Plano, TX, US)
- Julie LISS (Scottsdale, AZ, US)
Cpc classification
G10L15/02
PHYSICS
A61B5/7264
HUMAN NECESSITIES
A61B5/4848
HUMAN NECESSITIES
A61B5/7455
HUMAN NECESSITIES
G10L15/22
PHYSICS
A61B5/4803
HUMAN NECESSITIES
G10L2015/025
PHYSICS
International classification
Abstract
Systems and methods use patient speech samples as inputs, use subjective multi-point ratings by speech-language pathologists of multiple perceptual dimensions of patient speech samples as further inputs, and extract laboratory-implemented features from the patient speech samples. A predictive software model learns the relationship between speech acoustics and the subjective ratings of such speech obtained from speech-language pathologists, and is configured to apply this information to evaluate new speech samples. Outputs may include objective evaluation of the plurality of perceptual dimensions for new speech samples and/or evaluation of disease onset, disease progression, or disease treatment efficacy for a condition involving dysarthria as a symptom, utilizing the new speech samples.
Claims
1.-10. (canceled)
11. A system for evaluating speech, the system comprising processor circuitry configured to: receive a signal comprising speech from an individual or representing speech from the individual; extract a plurality of features useful for predicting a rating scale-based evaluation of a neurological condition affecting speech or language based on the signal; and generate, using a predictive model, the evaluation of the neurological condition, wherein the predictive model is configured to generate the evaluation based on the plurality of features extracted from the signal.
12. The system of claim 11, wherein the processor circuitry is further configured to determine, based on said evaluation, at least one of disease onset, disease progression, or disease treatment efficacy for the neurological condition affecting speech or language.
13. The system of claim 12, wherein the neurological condition comprises dysarthria as a symptom.
14. The system of claim 11, wherein the system comprises a speech therapeutic device comprising audio input circuitry and stimulus circuitry, wherein the speech therapeutic device is configured to receive the signal comprising speech from the individual and provide a stimulus to the individual based on the evaluation of the rating scale.
15. The system of claim 14, wherein the speech therapeutic device comprises a behind-the-ear device, an ear-mold device, a headset, a headband, a smartphone, or a combination thereof.
16. The system of claim 11, wherein the predictive model is calibrated based on a plurality of expert ratings corresponding to a plurality of patient speech samples.
17. The system of claim 11, wherein the evaluation comprises a plurality of perceptual dimensions comprising two or more of nasality, prosody, articulatory precision, vocal quality, or severity.
18. The system of claim 11, wherein the plurality of features comprises one or more of envelope modulation spectrum, long-term average spectrum, spatio-temporal features, or dysphonia features.
19. The system of claim 11, wherein the plurality of features useful for predicting the rating scale has no more than 50 features.
20. The system of claim 11, wherein the evaluation comprises a multi-point evaluation.
21. The system of claim 11, wherein the processor circuitry is further configured to prompt the individual to read displayed text prior to, or concurrently with, receipt of the signal comprising speech from the individual or representing speech from the individual.
22. The system of claim 11, wherein the plurality of features are selected from laboratory-implemented features using cross-validation and sparsity-based feature selection.
23. The system of claim 11, further comprising a first graphical user interface configured to allow the individual to provide the signal comprising speech from the individual or representing speech from the individual;
24. The system of claim 23, further comprising a second graphical user interface configured to permit a speech-language pathologist or clinician to administer or review the speech.
25. A computer-implemented method for evaluating speech in a system involving processor circuitry, the method comprising: receiving, by the processor circuitry, a signal comprising speech from an individual or representing speech from the individual; extracting, by the processor circuitry, a plurality of features useful for predicting a rating scale-based evaluation of a neurological condition affecting speech or language from the signal; and generating, by the processor circuitry, the rating scale-based evaluation of the neurological condition using a predictive model, wherein the predictive model is configured to generate the evaluation based on the plurality of features extracted from the signal.
26. The method of claim 25, further comprising determining, based on said evaluation, at least one of disease onset, disease progression, or disease treatment efficacy for the neurological condition affecting speech or language.
27. The method of claim 25, further comprising using a speech therapeutic device to receive the signal comprising speech from the individual and provide a stimulus to the individual based on the evaluation of the rating scale.
28. The method of claim 25, wherein the predictive model is calibrated based on a plurality of expert ratings corresponding to a plurality of patient speech samples.
29. The method of claim 21, wherein the evaluation of the rating scale comprises a multi-point evaluation.
30. The method of claim 21, further comprising prompting the individual to read displayed text prior to, or concurrently with, receipt of the signal comprising speech from the individual or representing speech from the individual.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0045]
[0046]
[0047]
[0048]
[0049]
[0050]
[0051]
[0052]
[0053]
[0054]
[0055]
[0056]
[0057]
[0058]
DETAILED DESCRIPTION
[0059] The embodiments set forth herein represent the necessary information to enable those skilled in the art to practice the embodiments and illustrate the best mode of practicing the embodiments. Upon reading the description in light of the accompanying drawing figures, those skilled in the art will understand the concepts of the disclosure and will recognize applications of these concepts not particularly addressed herein. It should be understood that these concepts and applications fall within the scope of the disclosure and the accompanying claims.
[0060] In certain aspects, the present disclosure relates to a method for evaluating speech, a system for evaluating speech, a non-transitory computer readable medium storing software instructions, and a computer program including instructions for causing a processor to carry out a method.
[0061] In certain embodiments, a data matrix may be generated, said data matrix incorporating processed speech samples and speech-language pathologist ratings corresponding to the speech samples. Processing of the speech samples includes extraction of a plurality of laboratory-implemented features (e.g., an envelope modulation spectrum, a long-term average spectrum, spatio-temporal features, and dysphonia features). The speech-language pathologist ratings include subjective multi-point ratings of commonly assessed perceptual dimensions (e.g., two, three, four, or all five of nasality, prosody, articulatory precision, vocal quality, and severity). A subset of the plurality of laboratory-implemented features that is relevant for predicting a plurality of perceptual dimensions, and that simplifies computation by reducing multi-collinearity, is selected. The subset includes a unique set of laboratory-implemented features per dimension, and data therein may be centered and reduced to a manageable number of features (e.g., no greater than about 50, about 40, about 30, or about 25 features per perceptual dimension). The resulting feature set may be employed as an input to a predictive software model (e.g., an objective evaluation linear model) that predicts objective ratings from the down-selected and centered feature set representative of speech acoustics. The predictive software model captures the relationship between speech acoustics and subjective ratings. Cross-validation (or more preferably a combination of cross-validation and sparsity based-feature selection) may be used to generate and/or update (e.g., calibrate) a predictive software model that is configured to receive at least one additional patient speech sample and perform at least one of (a) generating an objective evaluation of the plurality of perceptual dimensions utilizing the at least one additional patient speech sample or (b) evaluating at least one of disease onset, disease progression, or disease treatment efficacy for a condition involving dysarthria as a symptom, utilizing the at least one additional patient speech sample. In certain embodiments, the objective evaluation of the plurality of perceptual dimensions includes a multi-point evaluation spanning all five dimensions outlined above.
[0062] The subject matter described herein may be implemented in hardware, software, firmware, or any combination thereof. As such, the terms “function” or “module” may be used herein to refer to hardware, software, and/or firmware for implementing the feature being described.
[0063] In one exemplary implementation, the subject matter described herein may be implemented using a computer readable medium having stored thereon executable instructions that, when executed by the processor of a computer, direct the computer to perform steps. Exemplary computer readable media suitable for implementing the subject matter described herein include disk memory devices (e.g., a compact disc (CD) or a digital video disc (DVD)), chip memory devices (e.g., a USB drive or memory card), programmable logic devices, application specific integrated circuits, network storage devices, and other non-transitory storage media. In one implementation, the computer readable medium may include a memory accessible by a processor of a computer or other like device. The memory may include instructions executable by the processor for implementing any of the methods described herein. In addition, a computer readable medium that implements the subject matter described herein may be located on a single device or computing platform, or may be distributed across multiple physical devices and/or computing platforms. An exemplary processor (also referred to as a processor circuit or processor circuitry) may comprise microprocessor(s), Central Processing Unit(s) (CPU(s)), Application Specific Integrated Circuit(s) (ASIC(s)), Field Programmable Gate Array(s) (FPGA(s)), or the like.
[0064] An initial step in building a predictive software model or decision engine is formation of a data matrix. For all speech samples in a database, a series of laboratory-implemented features are extracted. These laboratory-implemented features include two or more (or more preferably all of) the envelope modulation spectrum, the long-term average spectrum, spatio-temporal features, and dysphonia features. Such features are described hereinafter.
[0065] The envelope modulation spectrum (EMS) is a representation of slow-amplitude modulations in a signal and the distribution of energy in amplitude fluctuations across designated frequencies, collapsed over time. EMS has been shown to be a useful indicator of atypical rhythm patterns in pathological speech.
[0066] Each speech segment in a preexisting pathological speech database, x(t), is filtered into 7 octave bands with center frequencies of 125, 250, 500, 1000, 2000, 4000, and 8000 Hz. h.sub.i(t) denotes the filter associated with the i.sup.th octave. The filtered signal, x.sub.i(t), is then denoted by:
x.sub.i(t)=*x(t)
The envelope in the i.sup.th octave, denoted by e.sub.i(t), is extracted by:
e.sub.i(t)=h.sub.LPF(t)*H(x(t))
where H(.) is the Hilbert transform and h.sub.LPF(t) is the impulse response of a 20 Hz low-pass filter.
[0067] Once the amplitude envelope of the signal is obtained, the low-frequency variation in the amplitude levels of the signal can be examined. Fourier analysis quantifies the temporal regularities of the signal. Six EMS metrics are then computed from the resulting envelope spectrum for each of the 7 octave bands, x.sub.i(t), and the full signal, x(t): 1) Peak frequency, 2) Peak amplitude, 3) Energy in the spectrum from 3-6 Hz, 4) Energy in the spectrum from 0-4 Hz, 5) Energy in the spectrum from 4-10 Hz, and 6) Energy ratio between 0-4 Hz band and 4-10 Hz band.
[0068] The long-term average spectrum (LTAS) captures atypical average spectral information in the signal. Nasality, breathiness, and atypical loudness variation, which are common causes of intelligibility deficits in dysarthric speech, present as atypical distributions of energy across the spectrum; LTAS measures these cues in each octave. For each of the 7 octave bands, x.sub.i(t), and the full signal, x(t), the following are extracted: 1) average normalized root mean square (RMS) energy, 2) RMS energy standard deviation, 3) RMS energy range, and 4) pairwise variability of RMS energy between ensuing 20 ms frames.
[0069] The spatio-temporal features capture the evolution of vocal tract shape and dynamics in different time scales via auto- and cross-correlation analysis of formant tracks and mel-frequency cepstral coefficients (MFCC).
[0070] The dysphonia features capture atypical vocal quality through the analysis of pitch changes and pitch amplitude changes over time.
[0071] The data matrix generated by processing the speech samples and extracting the laboratory-implemented features results in high dimensional data. Regression in high dimensional space is notoriously difficult: dimensionality requires exponential growth in the number of exemplars as the intrinsic dimension of the data increases. Thus, a processor-implemented routine is constructed and implemented to select only a relevant subset of these features, through a combination of cross-validation and sparsity-based feature selection (e.g., involving lasso or .sub.1-regularized regression). Restated, subsets of acoustic metrics that map to perceptual ratings are identified. The selection criterion aims to (1) identify a subset of laboratory-implemented features that are relevant for predicting each of the five perceptual dimensions (nasality, prosody, articulatory precision, vocal quality, and severity) and (2) reduce the multi-collinearity problem, thereby enabling practical computation. This subset selection results in a unique set of features per perceptual dimension. Following this down-selection, principal components analysis may be used to center the data and further reduce the feature set to a manageable number (e.g., no greater than about 50, about 40, about 30, or about 25) for each dimension. This new centered feature set may advantageously be used as an input to the predictive software model, to permit objective evaluation of the plurality of perceptual dimensions (nasality, prosody, articulatory precision, vocal quality, and severity) from an additional patient speech sample. Automated acoustic measures disclosed herein are specifically designed to address challenges of dysarthric speech analysis.
[0072] For each perceptual dimension, the predictive software model (e.g., an objective evaluation linear model) predicts an objective rating (optionally expressed on a multi-point such as a 7-point scale) from the down-selected and centered speech acoustics. In certain embodiments, cross-validation is used to train the predictive software model. Cross-validation involves partitioning the data matrix into complementary subsets, learning the parameters of the decision engine on one subset (training speakers), and validating on the remaining subset (testing speakers). The error on the (held out) test data set is used to assess the predictive power of the predictive software model. A framework for generating a predictive software model utilizing cross-validation and sparsity-based feature selection (e.g., lasso or .sub.1-regularized regression) follows.
[0073] In general, a sparse statistical model is one in which only a relatively small number of parameters (or predictors) play an important role.
[0074] A leading example of a method that employs sparsity is linear regression, in which N observations of an outcome variable y.sub.i and p associated predictor variables (or features) x.sub.i−(x.sub.i1, . . . x.sub.ip).sup.T are observed. The goal is to predict an outcome from the predictors—both for actual prediction of future data and also to discover which predictors play an important role. A linear regression model assumes that:
where β.sub.0 and β=(β.sub.1, β.sub.2, . . . β.sub.p) are unknown parameters and e.sub.i is an error term. The method of least-squares provides estimates of the parameters by minimization of the least-squares objective function:
[0075] One limitation with the least-squares method is that interpretation of the final model is challenging if p is large. If p>N, then the least-squares estimates are not unique. In such a situation, an infinite set of solutions will make the objective function equal to zero, and these solutions tend to overfit the data as well.
[0076] In view of the limitations of the least-squares method, there is a need to constrain, or regularize, the estimation process. Such need is addressed by “lasso” or “.sub.1-regularized” regression, in which parameters are estimated by solving the problem:
where t is a user-specified parameter. The parameter t can be considered a budget on the total .sub.1 norm of the parameter vector, and the lasso finds the best fit within this budget. If the budget tis small enough, the lasso yields sparse solution vectors, having only some coordinates that are nonzero. The bound tin the lasso criterion is a kind of budget, in that it limits the sum of the absolute values of the parameter estimates, and controls the complexity of the model. In particular, larger values of t free up more parameters and allow the model to adapt more closely to the training data. Conversely, smaller values of t restrict the parameters more, leading to sparser, more interpretable models that fit the data less closely. The
.sub.1-norm represents the smallest value that yields a convex problem. Convexity simplifies the computation, and allows for scalable algorithms that can handle problems with a multitude of parameters.
[0077] The advantages of sparsity are therefore interpretation of the fitted model and computational convenience. But in recent years, a third advantage has emerged from mathematical analysis of this area, with such advantage being termed the “bet on sparsity” principle, namely: Use a procedure that does well in sparse problems, since no procedure does well in dense problems.
[0078] The lasso estimator for linear regression is a method that combines the least-squares loss with an .sub.1-constraint (or bound) on the sum of the absolute values of the coefficients. Relative to the least-squares solution, this constraint has the effect of shrinking the coefficients, and even setting some to zero. In this way, it provides an automatic method for performing model selection in linear regression. Moreover, unlike some other criteria for model selection, the resulting optimization problem is convex, and can be solved efficiently for large problems.
[0079] Given a collection of N predictor-response pairs {(x.sub.i,y.sub.i)}.sub.i=1.sup.N, the lasso finds the solution (, {circumflex over (β)}) to the optimization problem:
[0080] The preceding (“subject to . . . ”) constraint can be written more compactly as the .sub.1-norm constraint ∥β∥.sub.1≤t. Furthermore, the lasso optimization problem outlined above is often represented using matrix-vector notation. If y=(y.sub.1, . . . , y.sub.N) denotes the N-vector of responses and X is an N×p matrix with x.sub.i ∈ R.sup.p in its i.sup.th row, then the lasso optimization problem can be re-expressed as:
where 1 is the vector of N ones, and ∥.Math.∥.sub.2 denotes the usual Euclidean norm on vectors.
[0081] The predictors X may be standardized so that each column is centered according to:
and has unit variance:
[0082] Without standardization, the lasso solutions would depend on the units (e.g., pounds vs. kilograms, or meters vs. feet) used to measure the predictors, but standardization would not be necessary if all features were measured in the same units. For convenience, the outcome values y.sub.i may be centered (such that the intercept term B.sub.o can be omitted in the lasso optimization), with such centering meaning that:
[0083] It is often convenient to rewrite the lasso problem in the so-called Lagrangian form:
for some λ≥0. By Lagrangian duality, there is a one-to-one correspondence between the constrained problem (i.e., minimization of β.sub.0, β) and the Lagrangian form. That is, for each value of t solving the .sub.1-norm constraint ∥β∥.sub.1≤t, there is a corresponding value of λ that yields the same solution from the Lagrangian form.
[0084] In order to estimate this best value for t, artificial training and test sets can be created by splitting up the given dataset at random, and estimating performance on the test data, using cross-validation. One group may be fixed as the test set, and the remaining groups may be designed as the training set. The lasso may be applied to the training data for a range of different values, and each fitted model may be used to predict the responses in the test set, recording the mean-squared prediction errors for each value of t. This process is repeated a total number of times equal to the number of groups of data. In this way, a number of different estimates of the prediction error are obtained over a range of values of t.
[0085] The lasso problem is a quadratic problem with a convex constraint. Many sophisticated quadratic program methods exist for solving the lasso. One simple and effective computational algorithm that may be employed utilizes the criterion in Lagrangian form, namely:
[0086] It may be assumed that y.sub.i and the features x.sub.ij may be standardized so that:
and the intercept term β.sub.0 can be omitted. The Lagrangian form is especially useful for numerical computation of the solution by a simple procedure known as coordinate descent. A simple coordinate-wise scheme for solving the lasso problem involves repeatedly cycling through the predictors in a fixed (but arbitrary) order (e.g., j=1, 2, . . . p), wherein at the j.sup.th step, the coefficient β.sub.j is updated by minimizing the objective function in this coordinate while holding fixed all other coefficients {, k≠j} at their current values.
[0087] If the Lagrangian form objective is rewritten as:
then the solution for each β.sub.j can be expressed in terms of the
partial residual r.sub.i.sup.(j)=y.sub.i−Σ.sub.k≠jx.sub.ik{circumflex over (β)}.sub.k,
which removes, from the outcome, the current fit from all but the j.sup.th predictor. In terms of this partial residual, the j.sub.th coefficient is updated as:
(In the preceding equation, S.sub.λ represents a soft-thresholding operation S.sub.λ(x) that translates its argument x toward zero by the amount A and sets it to zero if ∥x∥≤λ.)
Equivalently, the update can be written as:
where the full residuals are:
r.sub.i=y.sub.i−Σ.sub.j=1.sup.px.sub.ij{circumflex over (β)}.sub.j.
[0088] The numerical computation algorithm operates by applying this soft-thresholding update repeatedly in a cyclical manner, updating the coordinates of {circumflex over (β)} (and therefore the residual vectors) along the way. Such algorithm corresponds to the method of cyclical coordinate descent, which minimizes the convex objective along each coordinate at a time. Under relatively mild conditions, such coordinate-wise minimization schemes applied to a convex function converge to a global optimum.
[0089] In other embodiments, a method of pathwise coordinate descent may be used to compute a lasso solution not only for a single fixed value of λ, but rather an entire path of solutions over a range of possible λ values. Such a method may begin with a value of λ just large enough that the only optimal solution is the all-zeroes vector, and then repeatedly decreasing λ by a small amount and running coordinate descent until convergence.
[0090] In certain embodiments, one or more routines or algorithms of the predictive software model may be implemented in R programming language, which is an open source programming language and software environment. R is a GNU package that is supported by the R Foundation for Statistical Computing (Vienna, Austria). If desired, other programming languages or software environments may be employed.
[0091]
[0092] As noted previously, existing objective measures in speech and language clinics focus on measuring aspects of speech signals that are not interpretable in clinical settings. Examples of such objective measures include instruments that measure pitch, formants, energy, and other similar metrics.
[0093] In contrast to these existing objective measures in speech and language clinics, embodiments according to the present disclosure are useful for bridging the subjective-objective divide by blending the face validity of perceptual assessment with the reliability of objective measures. Advances in signal processing and machine-learning in conjunction with the present disclosure are leveraged to model expert perceptual judgments, and to facilitate predictive software modeling of perceptual ratings of speech. Comparisons of outcomes between laboratory data and those collected in clinical settings inform the theories that support the model with real-world data. Technical capabilities will advance with the refinement of the speech algorithms to optimize their performance. Technology that affords stable objective measures of speech that map to expert perceptual ratings is anticipated to have high clinical impact. In particular, systems and methods disclosed herein may offer a platform to sensitively assess treatment efficacy, disease onset, and disease progression, etc. with unbiased perception-calibrated metrics.
[0094] While acoustic analysis of disordered speech is commonplace in research, technology has yet to be developed that adds clinical value. The approach disclosed herein is novel in several ways.
[0095] In certain embodiments, signal processing capabilities and machine learning algorithms may be leveraged to model (weighted) perceptions of experts (e.g., speech-language pathologists) in the generation and use of a predictive software model. Thus, the output of the predictive software model is immediately clinically transparent, and does not require any norms or references for comparison.
[0096] In certain embodiments, predictive software models disclosed herein are “learners,” meaning that the algorithms become more refined with each iteration.
[0097] In certain embodiments, systems and methods disclosed herein may be integrated in a telehealth platform. This would be transformative by expanding videoconference capabilities of current remote methods to provide analytical capabilities.
[0098] .sub.1-regularized regression, or more specifically the use of a combination of cross-validation and sparsity-based feature selection. Following generating or updating of the predictive software model, an additional patient speech sample may be obtained for processing with the predictive software model. According to step 46, a patient may be prompted (e.g., by a visual display device) to read text, optionally in conjunction with the provision to the patient of user-perceptible (e.g., tactile, visible, auditory, or the like) feedback while the at least one patient reads the displayed text, to alert the patient to attainment of one or more conditions indicative of a speech problem. Upon generation of the additional speech sample, such sample may be received (e.g., electronically received) by a speech evaluation system incorporating the predictive software model according to step 48. Operation of the predictive software model on the additional speech sample may result in one or more of (a) generating an objective evaluation of the plurality of perceptual dimensions utilizing the at least one additional patient speech sample, according to step 54; or (b) evaluating disease and/or treatment state (e.g., at least one of disease onset, disease progression, or disease treatment efficacy) for a condition involving dysarthria as a symptom, according to step 50. With respect to performance of the steps of either or both of steps 50, 54, a clinician may be notified of the result of the evaluation and an electronic patient record may be stored or updated according to steps 52, 66. Moreover, following performance of the step of step 54, results of the objective evaluation of the plurality of perceptual dimensions utilizing the at least one additional patient speech sample may be supplied to the predictive software model to enable the model to be updated, by returning to step 44.
[0099]
[0100]
[0101]
[0102]
[0103] The audio input circuitry 108 may include at least one microphone. In certain embodiments, the audio input circuitry 108 may include a bone conduction microphone, a near field air conduction microphone array, or a combination thereof. The audio input circuitry 108 may be configured to provide an input signal 122 that is indicative of the speech 116 provided by the patient 62 to the processing circuitry 110. The input signal 122 may be formatted as a digital signal, an analog signal, or a combination thereof. In certain embodiments, the audio input circuitry 108 may provide the input signal 122 to the processing circuitry 110 over a personal area network (PAN). The PAN may comprise Universal Serial Bus (USB), IEEE 1394 (FireWire) Infrared Data Association (IrDA), Bluetooth, ultra-wideband (UWB), Wi-Fi Direct, or a combination thereof. The audio input circuitry 108 may further comprise at least one analog-to-digital converter (ADC) to provide the input signal 122 in digital format.
[0104] The processing circuitry 110 may include a communication interface (not shown) coupled with the network 104 and a processor (e.g., an electrically operated processor (not shown) configured to execute a pre-defined and/or a user-defined machine readable instruction set, such as may be embodied in computer software) configured to receive the input signal 122. The communication interface may include circuitry for coupling to the PAN, a local area network (LAN), a wide area network (WAN), or a combination thereof. The processing circuitry 110 is configured to communicate with the server 106 via the network 104. In certain embodiments, the processing circuitry 110 may include an ADC to convert the input signal 122 to digital form. In other embodiments, the processing circuitry 110 may be configured to receive the input signal 122 from the PAN via the communication interface. The processing circuitry 110 may further comprise level detect circuitry, adaptive filter circuitry, voice recognition circuitry, or a combination thereof. The processing circuitry 110 may be further configured to process the input signal 122 and to provide an alert signal 124 to the stimulus circuitry 114.
[0105] The processor may be further configured to generate a record indicative of the alert signal 124. The record may comprise a rule identifier and an audio segment indicative of the speech 116 provided by the patient 62. In certain embodiments, the audio segment may have a total time duration of at least one second before the alert signal 124 and at least one second after the alert signal 124. Other time intervals may be used. For example, in other embodiments, the audio segment may have a total time duration of at least three seconds, at least five seconds, or at least ten seconds before the alert signal 124 and at least three seconds, at least five seconds, or at least ten seconds after the alert signal 124. In other embodiments, at least one reconfigurable rule may comprise a pre-alert time duration and a post-alert time duration, wherein the audio segment may have a total time duration of at least the pre-alert time duration before the alert signal 124 and at least the post-alert time duration after the alert signal 124. In certain embodiments, the foregoing audio segments may be used as patient speech samples according to speech evaluation systems and methods disclosed herein. By identifying conditions indicative of speech errors in speech samples, samples exhibiting indications of dysarthria may be identified (e.g., flagged) and preferentially stored, aggregated, and/or used by a speech evaluation system.
[0106] A record corresponding to a speech sample may optionally include a location identifier, a time stamp, or a combination thereof indicative of the alert signal 124. The location identifier may comprise a Global Positioning System (GPS) coordinate, a street address, a contact name, a point of interest, or a combination thereof. In certain embodiments, a contact name may be derived from the GPS coordinate and a contact list associated with the patient 62. The point of interest may be derived from the GPS coordinate and a database including a plurality of points of interest. In certain embodiments, the location identifier may be a filtered location for maintaining the privacy of the patient 62. For example, the filtered location may be “user's home”, “contact's home”, “vehicle in transit”, “restaurant”, or “user's work”. In certain embodiments, the at least one reconfigurable rule may comprise a location type, wherein the location identifier is formatted according to the location type.
[0107] The processing circuitry 110 is configured to communicate with the memory 112 for storage and retrieval of information, such as subroutines and data utilized in predictive software models—including (but not limited) to patient speech samples, subjective expert ratings corresponding to patient speech samples, and subsets of laboratory-implemented features. The memory 112 may be a non-volatile memory, a volatile memory, or a combination thereof. The memory 112 may be wired to the processing circuitry 110 using an address/data bus. In certain embodiments, the memory 112 may be a portable memory coupled with the processor via the PAN.
[0108] The processing circuitry 110 may be further configured to transmit one or more records via the network 104 to the server 106. In certain embodiments, the processor may be further configured to append a device identifier, a user identifier, or a combination thereof to the record. A device identifier may be unique to the speech therapeutic device 72, and a user identifier may be unique to the patient 62. The device identifier and the user identifier may be useful to a speech-language pathologist or other speech therapeutic professional, wherein the patient 62 may be a patient of the pathologist or other professional.
[0109] The stimulus circuitry 114 is configured to receive the alert signal 124 and may comprise a vibrating element, a speaker, a visual indicator, or a combination thereof. In certain embodiments, the alert signal 124 may encompass a plurality of alert signals including a vibrating element signal, a speaker signal, a visual indicator signal, or a combination thereof. In certain embodiments, a speaker signal may include an audio signal, wherein the processing circuitry 110 may provide the audio signal as voice instructions for the patient 62.
[0110] The network 104 may comprise a PAN, a LA), a WAN, or a combination thereof. The PAN may comprise Universal Serial Bus (USB), IEEE 1394 (FireWire) Infrared Data Association (IrDA), Bluetooth, ultra-wideband (UWB), Wi-Fi Direct, or a combination thereof. The LAN may include Ethernet, 802.11 WLAN, or a combination thereof. The network 104 may also include the Internet. The server 106 may comprise a personal computer (PC), a local server connected to the LAN, a remote server connected to the WAN, or a combination thereof. In certain embodiments, the server 106 may be a software-based virtualized server running on a plurality of servers.
[0111] As used herein, the term “audio sample” may refer to a single discrete number associated with an amplitude at a given time. Certain embodiments may utilize a typical audio sampling rate of 8 kHz or 44.1 kHz. As used herein, the term “audio signal frame” may refer to a number of consecutive audio signal samples. In certain embodiments, a typical length of time associated with an audio signal frame may be in a range of from 20 ms to 50 ms. For an audio signal frame of 20 ms at an 8 kHz sampling rate, and for an audio clip of one second, there are 1/20 ms=50 frames, and for each frame there are 8000/50=40 samples.
[0112]
[0113]
[0114]
[0115] The foregoing graphical user interface screens may be prepared using MATLAB (MathWorks, Natick, Mass.) or another suitable software.
[0116]
[0117]
[0118]
[0119]
[0120] Referring to
[0121] Referring to
[0122]
[0123] Upon reading the foregoing description in light of the accompanying drawing figures, those skilled in the art will understand the concepts of the disclosure and will recognize applications of these concepts not particularly addressed herein. Those skilled in the art will recognize improvements and modifications to the preferred embodiments of the present disclosure. All such improvements and modifications are considered within the scope of the concepts disclosed herein.