G10L15/12

AUTOMATIC SPEECH RECOGNITION SYSTEM ADDRESSING PERCEPTUAL-BASED ADVERSARIAL AUDIO ATTACKS
20200395035 · 2020-12-17 ·

A computer-implemented method for creating a combined audio signal in a speech recognition system, the method includes sampling the audio input signal to generate a time-domain sampled input signal, then converting the time-domain sampled input signal to a frequency-domain input signal, afterwards generating perceptual weights in response to frequency components of critical bands of the frequency-domain input signal, creating a time-domain adversary signal in response to the perceptual weights; and combining the time-domain adversary signal with the audio input signal to create a combined audio signal, wherein a speech processing of the combined audio signal will output a different result from speech processing of the audio input signal.

Diagnostic techniques based on speech-sample alignment
20200294531 · 2020-09-17 ·

Reference-sample feature vectors that quantify acoustic features of different respective portions of at least one reference speech sample, which was produced by a subject at a first time while a physiological state of the subject was known, are obtained. At least one test speech sample that was produced by the subject at a second time, while the physiological state of the subject was unknown, is received. Test-sample feature vectors that quantify the acoustic features of different respective portions of the test speech sample are computed. The test-sample feature vectors are mapped to respective ones of the reference-sample feature vectors, under predefined constraints, such that a total distance between the test-sample feature vectors and the respective ones of the reference-sample feature vectors is minimized. In response to the mapping, an output indicating the physiological state of the subject at the second time is generated. Other embodiments are also described.

Apparatus, method for voice recognition, and non-transitory computer-readable storage medium
10733986 · 2020-08-04 · ·

An apparatus for voice recognition transforms voice information into a phoneme sequence expressed by characters of individual phonemes corresponding to feature parameters of the voice information, determines, based on a first likelihood and a second likelihood, whether or not collation succeeds, executes a matching operation that includes associating, based on a collation result, individual phonemes of the phoneme sequence of the voice information at a time of a failure of collation and individual phonemes of a phoneme sequence of previous voice information with each other, and executes a determination operation that includes determining, based on a result of the association, whether or not the phoneme sequence of the voice information is based on repetitive vocalization.

UNSUPERVISED KEYWORD SPOTTING AND WORD DISCOVERY FOR FRAUD ANALYTICS
20200243077 · 2020-07-30 ·

Embodiments described herein provide for a computer that detects one or more keywords of interest using acoustic features, to detect or query commonalities across multiple fraud calls. Embodiments described herein may implement unsupervised keyword spotting (UKWS) or unsupervised word discovery (UWD) in order to identify commonalities across a set of calls, where both UKWS and UWD employ Gaussian Mixture Models (GMM) and one or more dynamic time-warping algorithms. A user may indicate a training exemplar or occurrence of call-specific information, referred to herein as a named entity, such as a person's name, an account number, account balance, or order number. The computer may perform a redaction process that computationally nullifies the import of the named entity in the modeling processes described herein.

Speech privacy system and/or associated method
10726855 · 2020-07-28 · ·

Certain example embodiments relate to speech privacy systems and/or associated methods. The techniques described herein disrupt the intelligibility of the perceived speech by, for example, superimposing onto an original speech signal a masking replica of the original speech signal in which portions of it are smeared by a time delay and/or amplitude adjustment, with the time delays and/or amplitude adjustments oscillating over time. In certain example embodiments, smearing of the original signal may be generated in frequency ranges corresponding to formants, consonant sounds, phonemes, and/or other related or non-related information-carrying building blocks of speech. Additionally, or in the alternative, annoying reverberations particular to a room or area in low frequency ranges may be cut out of the replica signal, without increasing or substantially increasing perceived loudness.

Leveraging natural language processing

A system, computer program product, and method are provided to automate a natural language processing system to facilitate an artificial intelligence platform defining a relationship between dialogue and post dialogue activity. Dialogue is detected and analyzed, including identification of key words and phrases within the dialogue. Post dialogue actions, including physical actuation of a hardware device and an associated temporal proximity of the action and the dialogue, are monitored. The hardware device receives an instruction from a processing unit that relates to the analyzed dialogue and the hardware device changes states and/or actuates another hardware device. The system constructs a hypothesis, i.e., a relationship from the identified key phrase drawn from the analyzed dialogue and the monitored post action dialogue. A dialogue tree containing identified terms and associated post dialogue actions is dynamically modified with one or more new identified terms and the associated post dialogue actions.

Leveraging natural language processing

A system, computer program product, and method are provided to automate a natural language processing system to facilitate an artificial intelligence platform defining a relationship between dialogue and post dialogue activity. Dialogue is detected and analyzed, including identification of key words and phrases within the dialogue. Post dialogue actions, including physical actuation of a hardware device and an associated temporal proximity of the action and the dialogue, are monitored. The hardware device receives an instruction from a processing unit that relates to the analyzed dialogue and the hardware device changes states and/or actuates another hardware device. The system constructs a hypothesis, i.e., a relationship from the identified key phrase drawn from the analyzed dialogue and the monitored post action dialogue. A dialogue tree containing identified terms and associated post dialogue actions is dynamically modified with one or more new identified terms and the associated post dialogue actions.

Speech recognition by selecting and refining hot words

Speech recognition is performed by receiving a speech signal that includes spoken phones. A dynamic time warping procedure is applied to the received speech signal to generate a time-warped signal. The time-warped signal is compared to a plurality of stored reference patterns to identify a set of stored reference patterns that are most similar to the time-warped signal. A candidate hot word is selected from a list using the identified set of stored reference patterns. The selection of the candidate hot word is then refined.

Speech recognition by selecting and refining hot words

Speech recognition is performed by receiving a speech signal that includes spoken phones. A dynamic time warping procedure is applied to the received speech signal to generate a time-warped signal. The time-warped signal is compared to a plurality of stored reference patterns to identify a set of stored reference patterns that are most similar to the time-warped signal. A candidate hot word is selected from a list using the identified set of stored reference patterns. The selection of the candidate hot word is then refined.

Classification of teaching based upon sound amplitude

A system is provided to determine teaching technique based upon sound amplitude comprising: processor; and a memory device holding an instruction set executable on the processor to cause the computer system to perform operations comprising: sampling amplitude of sound at a sampling rate; assigning a respective sound amplitude and a respective amplitude variation to the respective sound sample; and classifying the sound samples based upon the assigned sound amplitude and sound sample variation.