G10L21/057

DEVICE AND METHOD OF CONTROLLING AUDIO TIME STRETCHING FOR DETERMINING COMPRESSION RATE BASED ON CLUSTER

A device for controlling audio time stretching includes a silence interval unit configured to detect a silence interval of an audio, a cluster unit configured to classify at least one of frames except the detected silence interval of the audio to plural clusters and a script unit configured to set compression rate to the clusters and generate a speed script including information concerning the clusters with the set compression rate. Here, one or more of the clusters have different compression rate from another cluster.

DEVICE AND METHOD OF CONTROLLING AUDIO TIME STRETCHING FOR DETERMINING COMPRESSION RATE BASED ON CLUSTER

A device for controlling audio time stretching includes a silence interval unit configured to detect a silence interval of an audio, a cluster unit configured to classify at least one of frames except the detected silence interval of the audio to plural clusters and a script unit configured to set compression rate to the clusters and generate a speed script including information concerning the clusters with the set compression rate. Here, one or more of the clusters have different compression rate from another cluster.

VOICE TRANSFORMATION FOR THROAT MICROPHONES

Systems and methods are provided for transforming audio signals captured by a throat microphone into signals emulating speech recorded with a conventional air-conduction microphone. Throat microphones employ vibration sensors positioned on the neck to capture audio, making them suitable for high-noise environments. However, throat microphone signals lack high-frequency components, reducing intelligibility and degrading automatic speech recognition performance. The techniques provided herein apply signal-processing operations and a lightweight neural network to reconstruct missing spectral details. The input signal is converted to log-Mel spectra and modeled as a smooth average spectrum (SAS) plus a residual component. A neural network predicts a conventional-microphone SAS. A vocoder synthesizes an enhanced audio signal after combining the predicted SAS with the residual component. The approach improves speech intelligibility and ASR accuracy while maintaining low computational complexity, enabling real-time, on-device processing in noisy environments and supporting hands-free communication for applications such as collaborative robotics and augmented reality.

VOICE TRANSFORMATION FOR THROAT MICROPHONES

Systems and methods are provided for transforming audio signals captured by a throat microphone into signals emulating speech recorded with a conventional air-conduction microphone. Throat microphones employ vibration sensors positioned on the neck to capture audio, making them suitable for high-noise environments. However, throat microphone signals lack high-frequency components, reducing intelligibility and degrading automatic speech recognition performance. The techniques provided herein apply signal-processing operations and a lightweight neural network to reconstruct missing spectral details. The input signal is converted to log-Mel spectra and modeled as a smooth average spectrum (SAS) plus a residual component. A neural network predicts a conventional-microphone SAS. A vocoder synthesizes an enhanced audio signal after combining the predicted SAS with the residual component. The approach improves speech intelligibility and ASR accuracy while maintaining low computational complexity, enabling real-time, on-device processing in noisy environments and supporting hands-free communication for applications such as collaborative robotics and augmented reality.