G10L15/187

Variable-speed phonetic pronunciation machine

A machine causes a touch-sensitive screen to present a graphical user interface that depicts a slider control aligned with a word that includes a first alphabetic letter and a second alphabetic letter. A first zone of the slider control corresponds to the first alphabetic letter, and a second zone of the slider control corresponds to the second alphabetic letter. The machine detects a touch-and-drag input that begins within the first zone and enters the second zone. In response to the touch-and-drag input beginning within the first zone, the machine presents a first phoneme that corresponds to the first alphabetic letter, and the presenting of the first phoneme may include audio playback of the first phoneme. In response to the touch-and-drag input entering the second zone, the machine presents a second phoneme that corresponds to the second alphabetic letter, which may include audio playback of the second phoneme.

Variable-speed phonetic pronunciation machine

A machine causes a touch-sensitive screen to present a graphical user interface that depicts a slider control aligned with a word that includes a first alphabetic letter and a second alphabetic letter. A first zone of the slider control corresponds to the first alphabetic letter, and a second zone of the slider control corresponds to the second alphabetic letter. The machine detects a touch-and-drag input that begins within the first zone and enters the second zone. In response to the touch-and-drag input beginning within the first zone, the machine presents a first phoneme that corresponds to the first alphabetic letter, and the presenting of the first phoneme may include audio playback of the first phoneme. In response to the touch-and-drag input entering the second zone, the machine presents a second phoneme that corresponds to the second alphabetic letter, which may include audio playback of the second phoneme.

Data generation apparatus and data generation method that generate recognition text from speech data

According to one embodiment, the data generation apparatus includes a speech synthesis unit, a speech recognition unit, a matching processing unit, and a dataset generation unit. The speech synthesis unit generates speech data from an original text. The speech recognition unit generates a recognition text by speech recognition from the speech data. The matching processing unit performs matching between the original text and the recognition text. The dataset generation unit generates a dataset in such a manner where the speech data, from which the recognition text satisfying a certain condition for a matching degree relative to the original text is generated, is associated with the original text, based on a matching result.

Data generation apparatus and data generation method that generate recognition text from speech data

According to one embodiment, the data generation apparatus includes a speech synthesis unit, a speech recognition unit, a matching processing unit, and a dataset generation unit. The speech synthesis unit generates speech data from an original text. The speech recognition unit generates a recognition text by speech recognition from the speech data. The matching processing unit performs matching between the original text and the recognition text. The dataset generation unit generates a dataset in such a manner where the speech data, from which the recognition text satisfying a certain condition for a matching degree relative to the original text is generated, is associated with the original text, based on a matching result.

Recognition or synthesis of human-uttered harmonic sounds
11545143 · 2023-01-03 ·

Within each harmonic spectrum of a sequence of spectra derived from analysis of a waveform representing human speech are identified two or more fundamental or harmonic components that have frequencies that are separated by integer multiples of a fundamental acoustic frequency. The highest harmonic frequency that is also greater than 410 Hz is a primary cap frequency, which is used to select a primary phonetic note that corresponds to a subset of phonetic chords from a set of phonetic chords for which acoustic spectral is available. The spectral data can also include frequencies for primary band, secondary band (or secondary note), basal band, or reduced basal band acoustic components, which can be used to select a phonetic chord from the subset of phonetic chords corresponding to the selected primary note.

Recognition or synthesis of human-uttered harmonic sounds
11545143 · 2023-01-03 ·

Within each harmonic spectrum of a sequence of spectra derived from analysis of a waveform representing human speech are identified two or more fundamental or harmonic components that have frequencies that are separated by integer multiples of a fundamental acoustic frequency. The highest harmonic frequency that is also greater than 410 Hz is a primary cap frequency, which is used to select a primary phonetic note that corresponds to a subset of phonetic chords from a set of phonetic chords for which acoustic spectral is available. The spectral data can also include frequencies for primary band, secondary band (or secondary note), basal band, or reduced basal band acoustic components, which can be used to select a phonetic chord from the subset of phonetic chords corresponding to the selected primary note.

TRACKING ARTICULATORY AND PROSODIC DEVELOPMENT IN CHILDREN

Systems, devices, and methods for tracking articulatory and prosodic development in children are disclosed. Human speech in a given language can be divided into phonemes, which are a sound or group of sounds perceived by speakers of the language to have a common linguistic function (e.g., consonant sounds, vowel sounds). In an exemplary aspect, a normative model can be generated for production characteristics of each phoneme in a given language using a database of normative speech samples. One or more speech samples of a human subject can be analyzed to identify the phonemes used by the human subject and measured against the normative model. Based on this analysis, a normed score is generated of the articulation accuracy, duration, rhythm, volume, and/or other production characteristics for each phoneme of the speech sample of the human subject.

TRACKING ARTICULATORY AND PROSODIC DEVELOPMENT IN CHILDREN

Systems, devices, and methods for tracking articulatory and prosodic development in children are disclosed. Human speech in a given language can be divided into phonemes, which are a sound or group of sounds perceived by speakers of the language to have a common linguistic function (e.g., consonant sounds, vowel sounds). In an exemplary aspect, a normative model can be generated for production characteristics of each phoneme in a given language using a database of normative speech samples. One or more speech samples of a human subject can be analyzed to identify the phonemes used by the human subject and measured against the normative model. Based on this analysis, a normed score is generated of the articulation accuracy, duration, rhythm, volume, and/or other production characteristics for each phoneme of the speech sample of the human subject.

Method and system for processing speech signal

Embodiments of the present disclosure provide methods and systems for processing a speech signal. The method can include: processing the speech signal to generate a plurality of speech frames; generating a first number of acoustic features based on the plurality of speech frames using a frame shift at a given frequency; and generating a second number of posteriori probability vectors based on the first number of acoustic features using an acoustic model, wherein each of the posteriori probability vectors comprises probabilities of the acoustic features corresponding to a plurality of modeling units, respectively.

Method and system for processing speech signal

Embodiments of the present disclosure provide methods and systems for processing a speech signal. The method can include: processing the speech signal to generate a plurality of speech frames; generating a first number of acoustic features based on the plurality of speech frames using a frame shift at a given frequency; and generating a second number of posteriori probability vectors based on the first number of acoustic features using an acoustic model, wherein each of the posteriori probability vectors comprises probabilities of the acoustic features corresponding to a plurality of modeling units, respectively.