Patent classifications
G10L25/93
Method and Apparatus for Obtaining Information from the Web
An intelligent conversation system augmenting a conversation between two or more individuals uses a speech to text block configured to convert voices of the conversation into text, a determination circuit configured to determine topics from the text of the conversation, search parameters determined by the determination circuit from the topics are sent to an Internet, search results corresponding to the search parameters are received from the Internet; and a memory configured to store the search results received from the Internet. The speech to test block is configured to convert the search results to speech. An earphone is configured to transmit the speech to one of the two or more individuals. The speech is used by one of the individuals to augment the conversation.
Method and apparatus for decoding speech/audio bitstream
A method and an apparatus for decoding a speech/audio bitstream are disclosed, where the method for decoding a speech/audio bitstream includes determining whether a current frame is a normal decoding frame or a redundancy decoding frame, obtaining a decoded parameter of the current frame by means of parsing when the current frame is a normal decoding frame or a redundancy decoding frame, performing post-processing on the decoded parameter of the current frame to obtain a post-processed decoded parameter of the current frame, and using the post-processed decoded parameter of the current frame to reconstruct a speech/audio signal.
AUDIO PROCESSING APPARATUS
An audio processing apparatus includes a preprocessor which extracts a voice-band signal from a first electric signal, and outputs a first output signal containing the voice-band signal; a first controller which generates a first amplification coefficient for multiplying with the first output signal to compress a dynamic range of an intensity of the first output signal, and generates a first modified amplification coefficient by smoothing the first amplification coefficient with a first time constant; and a first multiplier which multiplies the first modified amplification coefficient and the first output signal. The first time constant is a first rise time constant when the intensity increases, and is a first decay time constant when the intensity decreases. The first rise time constant is not less than a temporal resolution of hearing of a hearing-impaired person, and is less than a duration time of sound which induces recruitment in the hearing-impaired person.
AUDIO PROCESSING APPARATUS
An audio processing apparatus includes a preprocessor which extracts a voice-band signal from a first electric signal, and outputs a first output signal containing the voice-band signal; a first controller which generates a first amplification coefficient for multiplying with the first output signal to compress a dynamic range of an intensity of the first output signal, and generates a first modified amplification coefficient by smoothing the first amplification coefficient with a first time constant; and a first multiplier which multiplies the first modified amplification coefficient and the first output signal. The first time constant is a first rise time constant when the intensity increases, and is a first decay time constant when the intensity decreases. The first rise time constant is not less than a temporal resolution of hearing of a hearing-impaired person, and is less than a duration time of sound which induces recruitment in the hearing-impaired person.
SUPERIMPOSING HIGH-FREQUENCY COPIES OF EMITTED SOUNDS
An audio emitter configured to emit a sound creates a high-frequency copy of the sound to be emitted. The high-frequency copy of the sound is superimposed over the sound, resulting in a composite signal. The composite signal is emitted by the emitter. The high-frequency copy is at a frequency inaudible to humans, enabling a receiver to identify the emitter and/or the sound.
SUPERIMPOSING HIGH-FREQUENCY COPIES OF EMITTED SOUNDS
An audio emitter configured to emit a sound creates a high-frequency copy of the sound to be emitted. The high-frequency copy of the sound is superimposed over the sound, resulting in a composite signal. The composite signal is emitted by the emitter. The high-frequency copy is at a frequency inaudible to humans, enabling a receiver to identify the emitter and/or the sound.
Method and system for generating advanced feature discrimination vectors for use in speech recognition
A method of renormalizing high-resolution oscillator peaks, extracted from windowed samples of an audio signal, is disclosed. Feature vectors are generated for which variations in both fundamental frequency and time duration of speech are substantially mitigated. The feature vectors may be aligned within a common coordinate space, free of those variations in frequency and time duration that occurs between speakers, and even over speech by a single speaker, to facilitate a simple and accurate determination of matches between those AFDVs generated from a sample of the audio signal and corpus AFDVs generated for known speech at the phoneme and sub-phoneme level. The renormalized feature vectors can be combined with traditional feature vectors such as MFCCs, or they can be used exclusively to identify voiced, semi-voiced and unvoiced sounds.
METHOD AND APPARATUS FOR PROCESSING LIVE STREAM AUDIO, AND ELECTRONIC DEVICE AND STORAGE MEDIUM
A method for processing live stream audio, and an electronic device and a storage medium are provided. The method is applied to a live streamer end, and includes: acquiring a first audio signal formed by mixing a guest audio signal with a background audio signal of the live streamer end; obtaining a second audio signal by performing echo cancellation on the guest audio signal in the first audio signal according to the guest audio signal; detecting a voice activity state of a guest end according to the guest audio signal, the first audio signal and the second audio signal; obtaining a third audio signal by performing echo cancellation on the first audio signal in a mixed audio signal according to the voice activity state and the first audio signal; synthesizing and pushing the second audio signal and the third audio signal to the guest end.
METHOD AND APPARATUS FOR PROCESSING LIVE STREAM AUDIO, AND ELECTRONIC DEVICE AND STORAGE MEDIUM
A method for processing live stream audio, and an electronic device and a storage medium are provided. The method is applied to a live streamer end, and includes: acquiring a first audio signal formed by mixing a guest audio signal with a background audio signal of the live streamer end; obtaining a second audio signal by performing echo cancellation on the guest audio signal in the first audio signal according to the guest audio signal; detecting a voice activity state of a guest end according to the guest audio signal, the first audio signal and the second audio signal; obtaining a third audio signal by performing echo cancellation on the first audio signal in a mixed audio signal according to the voice activity state and the first audio signal; synthesizing and pushing the second audio signal and the third audio signal to the guest end.
UTTERANCE SECTION DETECTION DEVICE, UTTERANCE SECTION DETECTION METHOD, AND PROGRAM
An utterance section detection device which is capable of detecting an utterance section with high accuracy on the basis of whether or not an end of a speech section is an end of utterance. The utterance section detection device includes a speech/non-speech determination unit configured to perform speech/non-speech determination which is determination as to whether a certain frame of an acoustic signal is speech or non-speech, an utterance end determination unit configured to perform utterance end determination which is determination as to whether or not an end of a speech section is an end of utterance for each speech section which is a section determined as speech as a result of the speech/non-speech determination, a non-speech section duration threshold determination unit configured to determine a threshold regarding a duration of a non-speech section on the basis of a result of the utterance end determination, and an utterance section detection unit configured to detect an utterance section by comparing a duration of a non-speech section following the speech section with the corresponding threshold.