System and method for extracting and using prosody features
09754580 ยท 2017-09-05
Assignee
Inventors
Cpc classification
G10L15/22
PHYSICS
G10L15/02
PHYSICS
G10L15/12
PHYSICS
International classification
G10L25/00
PHYSICS
G10L15/02
PHYSICS
G10L15/06
PHYSICS
Abstract
A system for carrying out voice pattern recognition and a method for achieving same. The system includes an arrangement for acquiring an input voice, a signal processing library for extracting acoustic and prosodic features of the acquired voice, a database for storing a recognition dictionary, at least one instance of a prosody detector for carrying out a prosody detection process on extracted respective prosodic features, communicating with an end user application for applying control thereto.
Claims
1. A method for applying voice pattern recognition, implementable on an input voice, said method comprising the steps of: acquiring said input voice extracting prosodic features from said input voice at least once; and carrying out a voice pattern classification process using dynamic time warping by integrating pattern matching with said extracted prosodic features to improve recognition performance using a Sakoe-Chuba search space, and reducing thereby the size of the search space on the basis of a detected pattern and a predetermined respective database entry in order to produce an output of said voice pattern classification process.
2. The method of claim 1, in which a 3.sup.rd party speech pattern recognition engine is used for providing a recognized pattern as an input to an end user application.
3. An automated assistant for speech disabled people, operating on a computing device, said assistant comprising: an input device for receiving user input voice wherein the input device comprises at least a speech input device for acquiring voice of said people; a signal library for extracting acoustic and prosodic features of said input voice; at least one prosody detector for extracting respective prosodic features; a database for storing a recognition dictionary based on predetermined mapping between voice features extracted from voice recording from said people and a reference; a voice pattern classifier in which one of said at least one prosody detector is integrated in dynamic time warping by integrating pattern matching with said extracted prosodic features to improve recognition performance using a Sakoe-Chuba search space, and reducing thereby the size of the search space for rendering an output; and an output device, for rendering said output.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) The principles and operation of the system and method according to the present invention may be better understood with reference to the drawings, and the following description, it being understood that these drawings are given or illustrative purposes only and not meant to be limiting, wherein:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
(19)
(20)
(21)
(22)
(23)
(24)
(25)
(26)
(27)
DESCRIPTION OF EMBODIMENTS OF THE INVENTION
(28) Embodiments of the present invention are described below. In the interest of clarity, not all features/components of an actual implementation are necessarily described.
(29)
(30) The system shown in
(31) 1. The learning stage: Speech is recorded (by microphone 20) and analyzed by the VCPE 40; the resulting association between the spoken information and the actual meaning is stored in the database 50. This process can be performed several times for same utterance in order to obtain a more reliable database and/or for different utterances to obtain a large database. 2. The recognition stage: The speech is recorded (by microphone 20) and analyzed by the VCPE. The spoken words (or their alternative representation) are compared with similar content found in the database 50, attempting to find a match or a similarity. The closest database item is selected, and the meaning associated with that entry is outputted to the end user application 55 over connection 35.
(32) Some preferred embodiments of the present invention are described next.
(33) As shown in
(34) The signal processing library contains the signal processing functions used by both the prosody detection module over connection 228 and the voice pattern recognition module over connection 230, e.g. cepstral coefficient extraction, pitch extraction, background noise estimation. A detailed illustration of the signal processing library is shown in
(35) The prosody detection module uses frame based signal processing functions, where the voice pattern classifier module mainly uses pattern based signal processing functions.
(36) The prosody detection module handles the aspects of the prosody processing, data management, classification, and adaptive learning.
(37) As shown in
(38) The voice to motion module 308 is responsible for transforming changes in voice features to motion on the screen controlled by the system controller over connection 310.
(39) A configurator module 312 handles the concentration of all configurable parameters in the prosody detection module such as sampling rate and which features are used, and is controlled by the system controller module over a connection 316 when access to a configurable global parameter is needed, such as on a change and on an acquisition. An adaptive module 324, which may be local or remote, handles the adaptive learning process involving the adaptation of the learning stage in response to data input, and is controlled and enabled by the system controller over a connection 326, further used for receiving data and user input from the resource manager module 328 in order to create an adaptive reference for better matching. A classifiers module 330 is responsible for dynamic classification of voice features (examplerising pitch or falling pitch) and is activated by the system controller over a connection 332.
(40) The resource manager module handles memory allocation, data for handling the database allocation, acquisition and deletion and is activated by the system controller module over a connection 334.
(41) High level control and communication commands between the user and the end user application are transferred over a connection 222 to an application programming interface (API) 336, which is activated by the system controller module over a connection 340 when a detection result is to be sent to the end user application.
(42) The voice pattern classifier module is mainly involved in the signal processing, resource manager, pattern classification and adaptive learning.
(43) As shown in
(44) The configurator module concentrates all the configurable parameters in the voice pattern classifier module, such as sampling rate and which features are used, and is used when access to a configurable global parameter is needed, such as on a change or on an acquisition.
(45) The adaptive learning module, which may be local or remote, is activated when the adaptive learning option is enabled, for the adaptive learning process (such as the adaptation of the learning stage in response to data input).
(46) The pattern matching classifiers module handles the classifier sub modules, and contains various sub modules capable of distinguishing which pattern is similar to another, and with which similarity score, such as DTW.
(47) The resource manager module handles memory allocation, data management for handling the database allocation, acquisition and deletion and is activated by the system controller module over a connection 436.
(48) The V/C/P classifier module is based on the article entitled: Speech segmentation without speech recognition described earlier.
(49) High level control and communication commands between the user and the end user application are transferred over a connection 220 to an application programming interface (API) 438, which is activated by the system controller module over a connection 440 when a detection result is to be sent to end user application.
(50) A detailed illustration of the signal processing library 208 shown in
(51) An extraction of the length of the speech is performed by the duration extraction module 504. An extraction of the volume is performed by the frame based volume extraction module 506 per frame or by the pattern based volume extraction 508 per pattern. The rate in which audio crosses the zero value (ZCR: Zero Cross Value) is extracted by a frame based ZCR extraction module 514 per frame or by the pattern based ZCR extraction module 516 per pattern. Similarly, the rate-of-speech (ROS) in an utterance is extracted by a frame based ROS extraction module 518 per frame or by the pattern based ROS extraction module 520 per pattern. Further, the extraction and tracking of formants (peaks in the audio spectrum) are performed by a frame based formant tracking module 522 per frame or by the formant tracking module 524 per pattern.
(52) An extraction of the Mel-frequency spectral coefficients (MFCC) of the speech is performed by the pattern based cepstral coefficient extraction module 528 per pattern. An extraction of the tone of the speech is performed by the frame based pitch extraction module 530 per frame or by the pattern based pitch extraction module 532 per pattern. A frame based background noise estimation module 534 and a pattern based background noise estimation module 536 estimate and model a background noise for different modules to use, such as VAD (Voice Activity Detection) and pitch extraction modules, per frame or per pattern. The detection of speech within a noisy environment is handled by a frame based VAD module 538 and a pattern based VAD module 540.
(53) The system may operate in 2 distinct stages, namely the learning stage and the operation stage. In the learning state a person's voice is recorded and analyzed, and the extracted features are stored in a database. In the operation state, the prosodic and acoustic voice features are extracted and analyzed, and in certain system modes defined later, pattern classification is also performed. Each time the system captures a new voice usage, it can analyze in online and offline process, the additional data to create an acoustic model which will better identify the users' voice.
(54)
(55) The voice information is passed by the system controller module 302 to the pitch extraction block 530. The pitch is extracted using the pitch extraction module 530, and the extracted results are stored in the database 210 via connections 330 and 218, in parallel to updating all the relevant modules. For improved statistics, this process must run at least twice for obtaining minimum and maximum values, and preferably be repeated more times for better accuracy.
(56) The process is managed by the system controller module, which commands the various modules and receives feedback from them. At first, the system controller module updates the configurator module 312 with the appropriate values received by this command, such as sampling rate, bytes per sample, and little/big endian. Then the database 210 is queried via connection 218. The resource manager module 328 creates variables/files/other form of containers for later encapsulation of the pitch values and their corresponding user identification parameters. Afterwards, the voice I/O module 300 is commanded to enable a microphone input, in accordance with the parameters in the configurator module. Voice samples may then be passed down via Voice I/O module to the low level features module 500, for low level feature extraction such as performing FFT, and delivering the output spectrum to the statistics module 501. The statistics module updates the statistical variables (such as a spectral average) with the new spectrum values, and sends the updated variables to the VAD module 538, which in turn decides in a decision point 602 whether the spectrum represents noise or speech, based on the statistics module variables. Upon detection of a noise only signal (and not a speech), the system reverts to receiving voice samples and analyzing them. In the case of speech detection, the spectrum values are sent to the pitch extraction module to perform pitch extraction, and for storing the resulted data in the memory. Such steps are repeated until a predefined number of noise frames are discovered indicating the end of speech and not only a pause. The minimum/maximum pitch values are then sent by the system controller module over connection 330 to the resource manager module which stores the pitch values in the predefined container in the database 210 over connection 218, while updating the system controller module that the current pitch learning process is finished. In response, the system controller module sends a finished successfully status update via connections 334, 222 to the end user application 55.
(57) A schema of the pattern classifier module during the learning stage shown in
(58) A dictionary is first built in order to perform voice pattern classification, or more specifically voice pattern matching.
(59) In the learning state, the end user application 55 sends over connection 220 a voice pattern classification learning command, via the API 438, to be processed in block 402. The API transforms the command into a language that the system can recognize and execute. For example if the end user application is JAVA based and the system is native based, the API functions as a translator between the two. The voice input over connection 226 is transmitted to the voice IO module 400. A process of voice activity detection is performed followed by pattern based feature extraction and V/C/P. The resulting segmentation and the ceptral coefficients are stored in a database 210, all relevant modules are accordingly updated, and this process can run several times, thus expanding the database.
(60) The process is managed by the system controller module 402, which commands the various modules and receives feedback from them. 1. The voice pattern classification learning command reaches the system controller module 402, which in turn processes it and sends the following commands: a. Update the configurator 406 with the appropriate values received with this command (examples: sampling rate, bytes per sample, little/big endian, etc.) b. Check for space in the database via the resource manager 410 for storing the volume, ZCR, pitch, V/C/P results and Cepstral coefficients values (the learning data). c. Check via the resource manager module if there is sufficient memory for reasonable audio sampling and processing for voice learning, if soallocate memory (in our case a buffer for samples, and a buffer for cepstral coefficients values). d. Create via resource manager module the appropriate matrix/file/other form of container that would later on contain the learning data values and their corresponding user identification parameters. e. Command the voice I/O module 400 to enable microphone input, in accordance to the parameters in the configurator module. 2. Voice samples may then be passed down via Voice I/O module to the low level features module 500, for low level feature extraction such as performing FFT, and delivering the output spectrum to the statistics module 501. 3. The statistics module updates the statistical variables (such as a spectral average) with the new spectrum values, and sends the updated variables to the VAD module 538 through the low level features module, which in turn decides in a decision point 702 whether the spectrum represents noise or speech, based on the statistics module variables. 4. Upon detection of a noise only signal (and not a speech), the system reverts to receiving voice samples and analyzing them (step 2-4) 5. In the case of speech detection, the whole utterance (from the start of the speech to the end of the speech) is stored in memory and then stored in database. 6. The system controller calls, through the low level features, the pattern based signal processing library 710 to perform pattern based pitch extraction 532, pattern based ZCR 516 and pattern based volume extraction 512 and pattern based cepstral coefficients extraction 516. Then V/C/P classification is performed using the V/C/P classifier module 420. 7. The resulting segmentation and ceptral coefficient are sent by the system controller module 402 over connection 436 to the resource manager module, which stores them in the database 210 over connection 216, while updating the system controller module that the voice pattern classification learning is finished. In response, the system controller module sends a finished successfully status update via connections 220, 440 to the end user application 55. 8. At this stage, an adaptive learning process may applyall relevant features extracted in stage 7 are sent through connection 426 to the adaptive learning module 424 for the adaptive learning process.
(61) Robust Multi-Domain Speech Processing
(62) Common approaches to speech feature extraction for automatic speech classification/recognition are based on short time spectral representation of speech waveform, where spectrum is computed from frames of speech samples, having a constant and predefined length. This approach is good enough for general speech processing. However, it performs worse for speech disabled people, whose speech is often muffled and spectralblurred. On the other hand, valuable information exists in time domain, and in prosodic features of speech. We propose to exploit the non-spectral domain information to enrich the speech features for automatic speech classification/recognition applications for speech disabled/defected populations. We call this interweaving of spectral with non-spectral domains as robust multi-domain signal processing.
(63) Robust multi-domain signal processing enables us to reduce an intrinsic speech features blurring, which is caused by the fixed framing of speech samples into frames. This fixed framing often produces mixed (muffled) frames from contiguous speech events of different types. This artifact occurs each time when a particular frame includes speech samples before and after transitions from one type of speech event to another. For example, speech start uncertainty may be up to 512 samples, when working with a sampling rate of 16 KHz and a frame length of 32 ms. Vowel to consonant transition uncertainty may be up to 512 samples as well. We solve the speech event transition uncertainty by increasing adaptively the sampling rate (or alternatively decreasing a frame length) around speech event transition candidates. This approach reduces drastically the speech frame impurity, which improves the discriminative power of speech features. This procedure is unnecessarily for standard speakers speech features extraction, because their voice is quite discriminative and most of speech processing schemes are invented and optimized for standard speech. For the speech transition candidate detection we apply features from frequency and time domains. For example, we augment the spectral features with high-resolution aperiodicity, energy measurements (and other prosodic features, computed in the time domain) for input features of an adaptive transition event classifier. We use the prosodic features, because each word has a unique prosodic temporal structure, and even words with minimal spectral differences between them (came-game, for example, which differ only by the voiced/unvoiced character of the first consonants) has different vowel and syllable durations due to voicing pattern and co-articulation. This multi-domain approach enables more precise speech event transition detection with low computational burden. Having non-mixed (pure) frames, aligned with speech event transitions, we prune the voice pattern classifier search space by applying anchors at salient nodes of dynamic time warping trellis. This way we may control adaptively the dynamic pattern matching search bandwidth that reduces time warping elasticity at forbidden regions (e.g., plosives, micro pauses etc.) and enables this elasticity in the desired regions (e.g., vowels).
(64) Voice Activity Detection
(65) The voice activity detection (VAD) (see
(66) We improve noise/speech misalignment by examining noise-to-speech transitions in a higher resolution. The goal of the higher resolution pass is to find a more precise speech start, and to realign the framing accordingly. Obviously, the correct speech start increases the speech frames' homogeneity. The more homogeneous frames the pattern matching receives, the better recognition performance may be achieved. The frames' homogeneity is important for voice activated applications for speech disabled population, because their speech is intrinsically more muffled and harder for discrimination. The higher resolution pass runs with smaller frames (up to 10), and with instantaneous quantities, while some metrics are computed in the time-domain directly, e.g., zero-crossing rate, high resolution pitch tracking.
(67) The high resolution computations are preformed within a certain low resolution frame, which is identified by VAD in a case of noise-to-speech segmentation. When the high resolution speech start is detected, a speech start is changed from the low resolution frame to the high resolution decision, and the framing starts from the high resolution start. The same is done for the frame where the speech ends. In this way, we achieve more homogeneous speech start frame, including much less samples from the adjacent noisy frame These more homogeneous frames run up to the end of the current homogenous and stationary segment, which is identified by a speech transition to another speech event, for These more homogeneous frames run up to the end of the current homogenous and stationary segment, which is identified by a speech transition to another speech event, for example, a transition from vowel to plosive, a transition from consonant to vowel etc, and the transitional detection is explained in the Speech Events Transition Estimation in the next section.
(68) Speech Events Transition Estimations
(69) After detecting the precise speech start, and realigning the speech frames accordingly, the speech features are computed in the long length frames, while, in parallel, we are looking for the speech event transitions. For this we apply techniques of segmentation without speech recognition, which incorporates the prosodic features, and high resolution energy, super resolution pitch estimation, as described at the article Speech Segmentation without Speech Recognition by Dong Wang described earlier, and zero-crossing estimation The speech event transitions are exploited in the dynamic pattern matching as the anchor points for pruning the search space.
(70) Dynamic Time Warping with Adaptive Search Band Limits
(71) We apply band search constraints in the search space of dynamic time warping pattern matching. Generally, due to the assumption that the duration fluctuation is usually small in speech, the adjustment window condition requires that |ij|<=R (for some natural number R, and i, j as appeared in
(72) Each block in the block diagonal search constraints approach corresponds to a particular speech event (e.g., voiced region), where voiced regions are more robust, rather than other classes. As indicated in
(73) Multilayer Recognition Algorithm
(74) In order to better differentiate between the recorded voice tags of the user with speech disabilities, an innovative multi-layer algorithm was developed.
(75) The algorithm performs several identification passes, until the speech is uniquely identified, each time using another orthogonal voice intonation feature.
(76) Broad phonetic classes also have their unique characteristics, enabling to distinguish between them. The vowels may be characterized as periodic signals, having most of their energy in the low frequency bands. The plosives may be characterized by a distinct onset between them, with a short energy concentration at the higher frequencies. The fricatives may be characterized by a low energy in the lower frequencies and a noise-like spectrum spread at the high frequencies.
(77) In the prior art approaches, speech feature vectors are of the same type, e.g. MFCC, possibly augmented by a zero-crossing rate and energy. The innovation in the algorithm is to augment the common speech feature set with orthogonal features, some are low level, like pitch and voicing, while some are high level, like broad phoneme classes and some are broad temporary pattern features.
(78) The implementation of this approach uses a number of DTW pattern matchers in series, where each DTW processor runs on its predefined feature set domain. An example of such serial execution is a) DTW on MFCC, energy and zero crossing rate, b) DTW on intonation pattern, c) DTW on broad phonetic classes, d) DTW on formant trajectories and e) DTW on temporary pattern features. In addition to DTW pattern matchers in series, we rescore the N-best results according to speaking rate and duration variability. Each DTW in this serial chain reduces a number of possible hypothesizes which will be checked at the consequent DTW stage. In other words, each stage generates a number of candidates (hypotheses), which are input to the DTW stage which follows it.
(79) The work flow of the proposed method is to first get the samples of the audio signal and extract features. The reference tags are stored as feature vectors. When an incoming utterance is received, again its features are extracted and a series of cascading DTWs are applied, using different features at each pass. The cascading DTW passes a diluted candidate list of reference tags until the best candidates remain.
(80)
(81) Database 210
(82) The database is built of two different databases: one in use by the prosody detectionFB Data, the second one is in use by the voice pattern classifierPB data.
(83)
(84) This database is represented in hierarchical left to right structure. Following is the description of the different blocks:
(85) User A 1100: this block represents a specific person. There is a possibility to support multiple users, and due to the reason that most work modes are speaker dependent, we must represent every speaker Tag A 1101: the textual representation of an utterance (for example the tag for the utterance I want to eat may be food or I want to eat) Statistical Data 1104: this block contains the template after the adaptive learning as described in the article Cross Words Reference Template for DTW based Speech Recognition Systems described earlier, and the current features representing the ref tag. Entry Data and Time 1102: the date and time in which the feature extraction was performed. Cepstral Coefficients Matrix 1106: a cepstral coefficients matrix representing the utterance ROS vector 1108: a vector representing rate of speech [words per minute Formant Tracking vector 1110: a vector representing the spectral location of formants along the utterance [khz] Duration value 1112: a value representing the duration of the utterance [mSec] Energy vector 1114: a vector representing the energy (volume) of the utterance [dB] Background Noise vector 1116: a vector representing the background noise along the utterance [%] Pitch vector 1118: a vector representing the pitch (tone) of the speech along the utterance [khz] ZCR vector 1120: a vector representing the zero crossing rate of the utterance [crossings per n mSec]
This database is represented in hierarchical left to right structure. Following is the description of the different blocks: User A 1200: this block represents a specific person. There is a possibility to support multiple users, and due to the reason that most work modes are speaker dependent, we must represent every speaker Entry date and time 1202: the date and time in which the feature extraction was performed Statistical Data 1204: this block contains the current features representing the user's learning stage (most updated) ROS min & max 1208: min & max values for the rate of speech [words per minute] Formant Tracking Edges 1210: min & max values representing the spectral location of formants along the utterance [khz] Duration min & max 1212: min and max values for the duration of the utterance [mSec] Volume min & max 1214: min & max values representing the volume of the utterance[dB] Background Noise level 1216: a value representing the background noise along the utterance [%] Pitch min & max 1218: min & max values representing the pitch (tone) of the speech along the utterance [khz] ZCR min & max 1220: min & max values representing the zero crossing rate of the utterance [crossings per n mSec]
(86) Operational Modes
(87) At least four operational modes may be defined for the system, and may be used following a learning mode, named: Prosody Prosody+3.sup.rd Party Speech Recognition Voice Pattern Classification Prosody+Voice Pattern Classification
(88) Prosody Mode
(89) The Prosody operational mode may be used after prosody learning more, and involves basic feature extraction, that may be sent synchronically (for example sending ZCR each 10 ms) or non-synchronically (on call).
(90) This mode is appropriate to any application which may utilize non-verbal forms of control, such as voice controlled applications.
(91) The block diagram for this mode is shown in
(92) An example is shown in
(93) The block diagram for this example is shown in
(94) On initial start of the application (or on user request) a configuration stage takes place, in which the 3.sup.rd Party Applicationsing training interacts with the user in order to receive configuration parameters such as the desired voice feature/features needed for prosody detection (e.g. pitch). In this application, there is no need for a learning stage.
(95) The recognition stage is as follows:
(96) 1. The sing training application plays the user specific note/s 2. A user (1310) uses his voice to reach that exact note/s 3. The acoustic waves created are transferred to the Mic. 20. 4. The voice signal input 25 is transferred to the prosody detection module 206 5. In the prosody detection module, the voice feature extraction is performed using the signal processing library 208, with respect to the values in database 210 and the voice feature is passed on to the end user application 55sing training. 6. The sing training application calculates how close the voice is to the original note/s and a feedback is sent to the user. 7. Steps 1-6 repeat until user exits, or the training is complete. A pause may occur when the user changes configuration.
(97) An additional example is shown in
(98) On initial start of the application (or on user request) a configuration stage takes place, in which the 3.sup.rd Party ApplicationVoice Feature Magnitude game interacts with the user in order to receive configuration parameters, for example:
(99) The desired voice feature/features (needed for the prosody detection) Sensitivity level. (optionalbased on the game requirements) Game level. (optionalbased on the game requirements) Effects level. (optionalbased on the game requirements)
Then, a learning stage may take place.
Once the learning stage is over, the recognition stage is as follows: 1. A user utters 2. The acoustic waves created are transferred to the Mic. 20. 3. The voice signal input 25 is transferred to the prosody detection module 206 4. In the prosody detection module, the voice feature extraction is performed using the signal processing library 208, with respect to the values in database 210 and the voice feature is passed on to the end user application 55voice feature magnitude game. 5. The end user applicationvoice feature magnitude game reacts proportionally to the values passed (222) by the prosody detection module 206, and changes the screen images based on the predefined ranges in which the voice features are captured. 6. Steps 1-5 repeat until user exits. A pause may occur when the user changes configuration/repeats the learning stage
(100) Prosody+3.sup.rd Party Speech Recognition Mode
(101) The Prosody+3.sup.rd Party Speech Recognition operational state may be used after prosody learning, and involves combining speech and Prosody Detection.
(102) In many applications, the user needs to write data, for example: word processors, edit boxes, instant messaging, text messaging, etc. In the recent years, the speech recognition revolutionized the way the user inputs the data, no longer tied to a keyboard or touch screen, he can use his own voice, and Speech To Text engine will translate it to a text information. However in this translation, the colors of the voice are lost, the intonation and intention, such as volume change, pitch change, duration, intonation patters (angry, sad, surprised, etc).
(103) The block diagram for this mode is shown in
(104) An example is shown in
(105) In
(106) In
(107) There is an option to integrate this application to social networks, such as Facebook, twitter, as shown in
(108) On initial start of the application (or on user request) a configuration stage takes place, in which the 3.sup.rd Party Applicationspeech and intonation recognition application interacts with the user in order to receive configuration parameters such as: The desired voice feature/features needed for prosody detection (e.g. duration, energy, pitch, ROS). The graphic elements to be controlled by it, defined by the 3.sup.rd Party application (Font size, Letter size, Letter color, Textbox type, distance between letters, keyboard based icons and marks, graphic icons
An example of the mapping of the voice features to appropriate effects is shown in
This mapping is stored in database 210.
Then a learning stage may take place.
The recognition stage is as follows: 1. A user utters 2. The acoustic waves created are transferred to the Mic. 20. 3. The voice signal input 25 is transferred to the Prosody Detection module 206 4. The prosody detection module 206 samples the signal and sends (222) the samples to the 3.sup.rd party speech recognition engine 224 to transform the samples to textual representation of the utterance 5. In parallel, the prosody detection module performs the voice feature extraction with respect to the values in database, and a decision of triggering/not triggering a certain graphic element is sent (222) to the end user applicationspeech and intonation recognition application. 6. The speech and intonation recognition application reacts by displaying/not displaying the preferred graphic elements. 7. Steps 1-6 repeat until user exits. A pause may occur when the user changes configuration/repeats the learning stage.
(109) An additional example is shown in
(110) An essential part of the application is the database. It needs to include intonation patterns of various well-spoken phrases that later can be analyzed and compared. These intonation patterns (or intonation models) must represent the ideal way to utter phrases in term of intonation, and may be created using statistical models, articulated speakers or by different means.
In this example, the user utters Why. The system analyzes the utterance of the intonation, such as the pitch level as shown in the figure, or any other feature extraction like volume, vowels accuracy level, etc.
A use case for such use is: people that study a new language can practice the intonation part of this language.
On initial start of the application (or on user request) a configuration stage takes place, in which the end user applicationintonation correction application interacts with the user in order to receive configuration parameters, for example: The desired voice feature/features to be extracted (needed for the prosody detection) possible features: Duration Volume Pitch ROS There is an option for choosing several inputs, and there is no learning stage in this implementation.
The recognition stage is as follows: 1. A user uses his voice 2. The acoustic waves created are transferred to the Mic. 20. 3. The voice signal input 25 is transferred to the prosody detection module 206 4. The prosody detection module 206 samples the signal and sends (226) the samples to the 3.sup.rd party speech recognition engine 224 to transform the samples to textual representation of the utterance 5. In parallel, the prosody detection module performs the prosodic feature extraction, resulting with an intonation. This intonation pattern, together with the corresponding intonation pattern associated with the text produced by the 3.sup.rd party speech recognition application is sent (222) to the end user applicationintonation correction application. 6. Steps 1-5 repeat until user exits. A pause may occur when the user changes configuration. The end user application displays the pattern extracted from the user's voice with the corresponding reference pattern, and possible analysis parameters (exampledifference, min and max difference, etc.)
In addition, the application can contain statistical information for the user to consider, examples: The general improvement/degradation of his sessions Current similarity scores between the phrases the user expressed and the models History of past similarity scores Other means of statistical analysis (variance, gradient, etc.)
(111) Voice Pattern Classification Mode
(112) The voice pattern classification mode (VPCM) is an approach for enhanced pattern matching achieved by integrating it with prosody features, or more precisely using prosody features as an integrated pattern classificationprosody detection engine, putting constraints on the recognizer search space, and by doing so, improving speech or voice recognition. The VPCM is based on the combination of following procedures described in the earlier sections: VAD Activity Speech events transmission estimation Dynamic Time Warping with Adaptive Search Band Limits Multilayer recognition algorithm
The learning stage for the VPCM is described above.
The recognition stage is very similar to the learning stage except for there is no storage in the database and an adaptive learning process described in article Cross words Reference Template for DTW-based Speech Recognition Systems as described above, implemented in module 424 may be performed.
In addition, the classification itself is performed, resulting with match results.
The block diagram for this mode is shown in
(113) An example is shown in
The block diagram for this example is shown in
On initial start of the application (or on user request) a configuration stage takes place, in which the end user application: voice control application interacts with the user in order to receive configuration parameters, such as: Sensitivity level (optionalbased on the application requirements) Screen lock (optionalbased on the application requirements) Feedback type (optionalbased on the application requirements) Security level (optionalbased on the application requirements) Selecting which commands are included (optionalbased on the application requirements)
Then a learning stage will take place (for patterns only). The learning stage in this example is exactly as described earlier for the voice pattern classifier module.
Database will store the patterns.
The recognition stage is as follows: 1. A user uses his voice 2. The acoustic waves created are transferred to the Mic. 20 3. The voice signal input 25 is transferred to the voice pattern classifier module 204 4. Using these samples, feature extraction and pattern classifier is performed in voice pattern classifier module with respect to the patterns stored in database 210. The classification result is then sent to the end user application: voice control application. 5. The end user application accepts the detection results (220) and performs the desired action. 6. Steps 1-5 may repeat until user exits, a pause may occur when the user changes the configuration/repeats the learning stage.
In this application, the concept of situations or domains is used. Situations are general subjects associated to a set of words, for example:
Indoor: turn on a game alarm clock show me a funny video
Car mode: answer call redial dismiss call
Of course there may be a possibility in which the same command will appear in different situations, this is a normal state and it need to be supported.
We may use situations to divide a large set of words into several subsets, so when the application is in recognition mode and the user says a word, the application will compare words only from the current situation, for example: when the user chose situation Car mode, the application will be able to detect only the words answer call, redial, dismiss call. If the user says alarm clock, the application will not be able to detect this word because it is not inside the current situation.
The use of situations enables the user to have a large dictionary, with a small probability for mistakes (there is a higher error rate in finding the right word out of 100 words than out of 10).
Situations may also be associated to location and time. This means that if a user is at a certain place, the system may automatically choose a specific situation, and the same for time margins. For example, if the user is usually indoors between hours 9-12 AM, the system may automatically load the indoor situation based on the current time. If the user is on the road, the system may load the car mode situation based on the physical whereabouts of the user given by the GPS module in the device.
This means that the user will have the ability to configure the application to allow: Manual situation selection Automatic time based situation selection Automatic location based situation selection Automatic time and location based situation location.
When defining a new situation, the user needs to enter the situation name (text), and may also define an icon representing the situation and he also may associate the time margin and/or location for this situation.
(114) An additional example is shown in
(115) The basic description of this module is an application that reveals a partial rhyme, and expects the user to complete it. Once the rhyme is completed, the system detects whether or not it is really a rhyme. A feedback containing the detection result is given to the user.
On initial start of the application (or on user request) a configuration stage takes place, in which the end user application: rhyming game application interacts with the user in order to receive configuration parameters, such as: Sensitivity level (optionalbased on the application requirements) Screen lock (optionalbased on the application requirements) Feedback type (optionalbased on the application requirements) Security level (optionalbased on the application requirements) Selecting which commands are included (optionalbased on the application requirements)
Then a learning stage will take place (voice features onlynot mandatory).
There is an option to have predefined patterns and voice feature calibration results inside the application, representing the ideal speech and basic voice feature limitsassuming the speech is standard.
The learning stage in this example is exactly as described earlier for this mode. Database will store the calibration values and Database will store the patterns.
The recognition stage is as follows: 1. A user uses his voice 2. The acoustic waves created are transferred to the Mic. 20 3. The voice signal input 25 is transferred to the voice pattern classifier module 204 4. The pattern classification is performed in module voice pattern classifier 204 with respect to the patterns stored in database 210. The classification result is then sent to the end user applicationrhyming game. 5. The end user application accepts (220) the detection results and triggers a certain interaction in the rhyming gamegood/bad rhyme for example. 6. Steps 1-5 may repeat until user exits, a pause may occur when the user changes the configuration/repeats the learning stage.
(116) An additional example is shown in
(117) In
(118) On initial start of the application (or on user request) a configuration stage takes place, in which the end user application voice to voice application interacts with the user in order to receive configuration parameters, such as:
(119) Translation voice Translation language System language Feedback type (voice, text or both)
Then a learning stage will take place (patterns only, mandatory)
The recognition state is as follows: 1. A user uses his voice 2. The acoustic waves created are transferred to the Mic. 20 3. The voice signal input 25 is transferred to the voice pattern classifier module 204 4. The pattern classification is performed in module voice pattern classifier 204 with respect to the patterns stored in database 210. The classification result is then sent to the end user applicationa voice to voice application 5. The end user application accepts (220) the detection results and send feedbacks to the user that his utterance matches (or not) a certain pattern with a specific tag in the system. 6. Steps 1-5 may repeat until user exits, a pause may occur when the user changes the configuration/repeats the learning stage.
In this application, the concept of situations is used.
Situations are general subjects associated to a set of words, for example: The words: play, run, run, ball game, go home may be associated to an outdoor situation, while the words: rest, eat, bathroom, go outside may be associated to an indoor situation.
We may use situations to divide a large set of words into several subsets, so when the application is in recognition mode and the user says a word, the application will compare words only from the current situation. For example: when the user chose situation outdoor, the application will be able to detect only the words play, run, ball game, go home. If the user says rest the application will not be able to detect this word because it is not inside the current situation.
Obviously, there is a possibility for having the same word in several different situations, and the application also needs to support this option. The use of situations enables the user to have a large dictionary, with a small probability for mistakes (there is a higher error rate in finding the right word out of 100 words than out of 10).
For example, if the user is usually playing between hours 9-12 AM, the system may automatically load the Play situation based on the current time; and/or if the user is outdoors, the system may load the outdoor situation based on the physical whereabouts of the user given by the GPS module in the device.
This means that the user will have the ability to configure the application to allow: Manual situation selection Automatic time based situation selection Automatic location based situation selection Automatic time and location based situation location.
When defining a new situation, the user needs to enter the situation name (text), and may also define an icon representing the situation and he also may associate the time margin and/or location for this situation, as shown in
When the user builds the dictionary, the application needs to record the utterance and associate the tags to them. In addition there is an option to add graphic representations (icons) to words and situations and associate location and time margins to situations.
Once the user has defined a dictionary, the system may work in recognition stage. The recognition stage may work in several sub modes: (1) Grid mode: Grid mode is a mode in which a small set of words is used to control and select a bigger set, in this mode one set is for word selection and another set is for navigation. (2) Scanning mode: Scanning mode is a mode in which the user is controlling the word navigation by using a very small word set. In this mode the application performs scanning iterations, in which certain elements (icons/words/icons+words) are emphasized, and every X seconds the emphasis passes to the next set of elements. By this way the user is able to use only one keyword to stop the scan. There is an option to pass between the subsets of elements. (3) Direct selection mode: in this mode there is a division of words into situations, and if the user utters a specific word, it has to be associated with the current situation, otherwise there will be no detection.
More details on the sub modes are as follows: Grid mode, an example as shown in
Integration possibilities are presented next: 1. Smart home, where the user's voice may trigger actions, as shown in
(120) Prosody+Voice Pattern Classification Mode
(121) This working combines the prosody mode and the voice pattern classification mode. In this mode there are 2 outputs (220,222) for the end user application 55, as shown in
(122) The learning stages are:
(123) Prosody learning for the prosody detection module as described earlier Voice pattern learning for the voice pattern classifier module as described earlier
The block diagram for this mode is shown in
(124) An example is shown in
(125) Pitch is the Y axis, intensity is the X axis.
(126) Thick paints with a broad line, Thin paints a thin line and Blue or Red allows the color to change to blue or red.
(127) In the figure,
(128) 2200an axis representing the motion
(129) 2202a pattern to functionality representation
(130) 2204a painting made by the predefined axis and patterns.
(131) This application is defined by a 3.sup.rd party company, thus the interaction and configuration will be defined by it.
(132) The basic description of this module is an application that enables drawing on the screen by the use of voice features. This module enables the user to define the set of axis to control the motion on the screen with, explanation is given as follows:
(133) On initial start of the application (or on user request) a configuration stage takes place, in which the end user applicationvoice paint application, interacts with the user in order to receive configuration parameters, such as:
(134) The desired voice features and patterns to control the screen movement (needed for the prosody detection and voice pattern classification) The association between the pattern\voice feature to the movement\drawing on the screen
An example for an axis set is shown in
In the figure,
2210an axis set representing the motion
2212a pattern to functionality representation
The figure shows an example for an axis set in which the vowels (or patterns) ee, u and a function as main axis, and the pitch and intensity sets the direction and speed of the movement. In addition the voice Beep holds/releases the brush thus allows easy painting.
Then, a learning stage may take place (mandatory for patterns, optional for voice features).
The recognition state is as follows:
The recognition stage is as follows: 1. A user uses his voice 2. The acoustic waves created are transferred to the Mic. 20 3. The voice signal input 25 is transferred to the voice pattern classifier module 204 and the prosody detection module 206. 4. In the prosody detection module, the voice feature extraction is performed with respect to the values in database 210 and the predefined voice features are passed on to the end user applicationvoice paint application. 5. In parallel the pattern classification is performed in voice pattern classifier module with respect to the patterns stored in database. The classification result is then sent to the end user applicationvoice paint application. 6. The end user application reacts proportionally to the passed values 220, 222 and controls the movement and action of the figure in the game. 7. Stages 1-6 repeat until user exits, a pause may occur when the user changes the configuration/repeats the learning stage.
(135) An additional example is shown in
(136) In the figure:
(137) 2300: a ninja that jumps over an obstacle. 2302: an obstaclea shuriken.
This application is defined by a 3.sup.rd party company, thus the configuration and configuration will be defined by it.
On initial start of the application (or on user request) a configuration stage takes place, in which the end user applicationintonation and pattern controlled game application interacts with the user in order to receive configuration parameters, such as: The desired voice features and patterns to control the screen movement (needed for the prosody detection and voice pattern classifier), such as: For voice features: 1. Pitch 2. Intensity 3. Duration 4. ROS 5. Formants For patterns: 1. Shoot 2. Bomb 3. Jump 4. Chee 5. Choo A combination of voice features may control one or several movements\actions. The association between the pattern\voice feature to the movement\actions on the screen, such as: 1. Shoot triggers the figure to shoot with intensity as the voice feature determining the magnitude of the shoot. 2. Jump triggers a jump with pitch determining the height. Thus a learning stage may take place (mandatory for patterns, optional for voice features). Database will store the calibration values and the patterns.
The recognition stage is as follows: 8. A user uses his voice 9. The acoustic waves created are transferred to the Mic. 20 10. The voice signal input 25 is transferred to the voice pattern classifier module 204 and the prosody detection module 206. 11. In the prosody detection module, the voice feature extraction is performed with respect to the values in database 210 and the predefined voice features are passed on to the end user applicationintonation and pattern controlled game. 12. In parallel the pattern classification is performed in voice pattern classifier module with respect to the patterns stored in database. The classification result is then sent to the end user applicationIntonation and Pattern Controlled game. 13. The end user application reacts proportionally to the passed values 220, 222 and controls the movement and action of the figure in the game. 14. Stages 1-6 repeat until user exits, a pause may occur when the user changes the configuration/repeats the learning stage.
(138) An additional example is shown in
Babies may cry due to tiredness, hygiene problems, gas build up, belly ache. The classification first needs to identify a baby cry, and then based on the voice features, analyze the possible cause for this cry.
In the figure: 2310a baby crying 2312a microphone
On initial start of the application (or on user request) a configuration stage takes place, in which the end user applicationBaby Voice Analyzer interacts with the user in order to receive configurations parameters, such as: Sensitivity level (optional, based on the application requirements) Screen lock (optional, based on the application requirements) Feedback type (optional, based on the application requirements)
Thus a learning stage may take place (mandatory for patterns and optional for voice features).
The recognition stage is as follows: 1. A user uses his voice 2. The acoustic waves created are transferred to the Mic. 20 3. The voice signal input 25 is transferred to the voice pattern classifier module 204 and the prosody detection module 206. 4. In the prosody detection module, the voice feature extraction is performed with respect to the values in database 210 and the predefined voice features are passed on to the end user applicationbaby voice analyzer. 5. In parallel the pattern classifier is performed in voice pattern classifier module with respect to the patterns stored in database. The classification result is then sent to the end user applicationbaby voice analyzer. 6. The end user application reacts proportionally to the passed values 220, 222 and determines whether or not there has been a cry and the pattern classification determines the cause. 7. Stages 1-6 repeat until user exits, a pause may occur when the user changes the configuration/repeats the learning stage
(139) A further example is illustrated in
(140) Upon starting the application (or on user request) a configuration stage takes place, in which the end user applicationbaby voice analyzer interacts with the user in order to receive parameters for certain configurations, such as:
(141) Sensitivity level (optional, based on the application requirements) Graphic details level (optional, based on the application requirements) Online or offline mode (optional, based on the application requirements) Music level (optional, based on the application requirements) Effects level (optional, based on the application requirements)
Thus a learning stage may take place (optional).
There is an option for predefined patterns and voice feature calibration results inside the application, representing the ideal speech and basic voice feature limitsassuming the speech is standard.
The recognition stage is as follows: 1. A toddler utters 2. The acoustic waves created are transferred to the Mic. 20 3. The voice signal input 25 is transferred to the voice pattern classifier module 204 and the prosody detection module 206. 4. In the prosody detection module, the voice feature extraction is performed with respect to the values in database 210 and the predefined voice features are passed on to the end user applicationeducational application for toddlers. 5. In parallel the pattern classification is performed by the voice pattern classifier module with respect to the patterns stored in database. The classification result is then sent to the end user applicationeducational application for toddlers. 6. The end user application reacts proportionally with the passed values 220, 222 and triggers a certain interaction in the educational frame. 7. Stages 1-6 are repeated until the user exits, a pause may occur when the user changes the configuration/repeats the learning stage.
(142) It should be understood that the above description is merely exemplary and that there are various embodiments of the present invention that may be devised, mutatis mutandis, and that the features described in the above-described embodiments, and those not described herein, may be used separately or in any suitable combination; and the invention can be devised in accordance with embodiments not necessarily described above.