Patent classifications
G10L2025/932
REFERENCE PICTURE RESAMPLING FOR VIDEO CODING
In some embodiments, a video decoder decodes a video bitstream into video frames. A decoder decodes frames of a video from a video bitstream. The decoder further performs inter prediction to decode a current frame of the video by using the decoded frames as reference frames. Performing the inter prediction includes performing reference picture resampling by upsampling a reference frame for the current frame using one or more filters selected from a set of 32 6-tap interpolation filters. This set of interpolation filters is also used for interpolating chroma components for motion compensation. The decoded frame and the decoded current frame are output for display.
Neural temporal beamformer for noise reduction in single-channel audio signals
This disclosure provides methods, devices, and systems for audio signal processing. The present implementations more specifically relate to multi-frame beamforming using neural network supervision. In some aspects, a speech enhancement system may include a linear filter, a deep neural network (DNN), a voice activity detector (VAD), and an IFC calculator. The DNN infers a probability of speech (p.sub.DNN) in a current frame of a single-channel audio signal based on a neural network model. The VAD determines whether speech is present or absent in the current audio frame based on the probability of speech p.sub.DNN. The IFC calculator may estimate an IFC vector based on the output of the DNN (such as the probability of speech p.sub.DNN) and the output of the VAD (such as an indication of whether speech is present in the current frame). The linear filter uses the IFC vector to suppress noise in the current audio frame.
Joint segmenting and automatic speech recognition
A joint segmenting and ASR model includes an encoder and decoder. The encoder configured to: receive a sequence of acoustic frames characterizing one or more utterances; and generate, at each output step, a higher order feature representation for a corresponding acoustic frame. The decoder configured to: receive the higher order feature representation and generate, at each output step: a probability distribution over possible speech recognition hypotheses, and an indication of whether the corresponding output step corresponds to an end of speech segment. The joint segmenting and ASR model trained on a set of training samples, each training sample including: audio data characterizing a spoken utterance; and a corresponding transcription of the spoken utterance, the corresponding transcription having an end of speech segment ground truth token inserted into the corresponding transcription automatically based on a set of heuristic-based rules and exceptions applied to the training sample.
Text time annotation method and apparatus, electronic device, and readable storage medium
In an embodiment a method includes receiving, by an electronic device, an annotation request, wherein the annotation request is used to request to annotate a playing start-end time period of each text unit in a text corresponding to a target audio, and wherein a text unit is at least a character, a word, or an expression, obtaining, by the electronic device, the playing start-end time period of each text unit in the target audio based on a fundamental frequency of the target audio, obtaining, by the electronic device, an annotation file based on annotating the playing start-end time period of each text unit in the text based on the playing start-end time period of each text unit in the target audio and outputting, by the electronic device, the annotation file.