System and method for audio-visual speech recognition

Abstract

Disclosed herein is method of performing speech recognition using audio and visual information, where the visual information provides data related to a person's face. Image preprocessing identifies regions of interest, which is then combined with the audio data before being processed by a speech recognition engine.

Claims

1. A method of performing speech recognition at a distance comprising: obtaining audio information from a plurality of microphones positioned at a distance from a speaker; obtaining visual information from an image capture device; pre-processing the visual information, wherein pre-processing comprises: using a recurrent deep neural network model to identify relevant image features by classifying a pixel in a first frame of the visual information using features from a pixel in the same location from a prior frame of the visual information; identifying a region of interest in the visual information; and using the region of interest, aligning the visual information with the acoustic information and classifying individual frames in the visual information into context-dependent phonetic states; combining the audio information and visual information within a single deep neural network classifier; and performing a speech recognition process on the combined audio information and visual information, wherein the speech recognition process comprises: generating observation probabilities for context-dependent phonetic states using a joint audio-visual observation model, and conducting a search in a standard speech recognition engine using the observation probabilities.

2. The method of claim 1, wherein pre-processing further comprises: determining whether a speaker is present in the visual information.

3. The method of claim 1, wherein the standard speech recognition engine is a WFST-based speech recognition engine.

4. The method of claim 1, wherein pre-processing further comprises: defining the region of interest by generating a probability distribution of classes for each pixel in the visual information.

5. The method of claim 1, further comprising: identifying a location of the speaker relative to the image capture device.

6. The method of claim 1, wherein pre-processing further comprises: scaling the region of interest.

7. The method of claim 1, wherein combining the audio information and visual information comprises: combining the audio information and the visual information in an output layer of the single deep neural network classifier.

8. The method of claim 1, wherein combining the audio information and visual information comprises: combining the audio information and the visual information in a second hidden layer of the single deep neural network classifier.

Description

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

(1) FIG. 1 is flowchart depicted feature detection.

(2) FIG. 2 is an example of visual speech processing.

(3) FIG. 3 is a diagram depicting two different processes for combining audio and visual information.

(4) FIG. 4 shows the network structure for a recurrent neural network, combining audio and visual features.

DETAILED DESCRIPTION

(5) According to embodiments of the present invention is a method of performing speech recognition at a distance. In one embodiment, the method provides for robust distance speech recognition that leverages multiple microphones as well as visual information from an image sensor. The method utilizes a joint model for audio-visual speech recognition based on Deep-Neural-Networks (DNN), in which the visual information informs both the beam-forming process and the speech recognition model to realize accurate and robust speech recognition even at distance.

(6) DNN's have been shown to obtain state-of-the-art performance across many image and speech processing tasks. However, there has been little exploration on how best to: (1) effectively model temporal changes within these models; and (2) combine information from these different modalities, such as audio and visual information within a single DNN structure.

(7) According to embodiments of the present invention, two main steps of the method comprise image preprocessing and audio-visual feature combination for speech recognition. In the first step, image preprocessing is performed to provide context about the information provided in the image. For example, in an image of containing a person's face, image preprocessing can: (1) determine the relative location of the person (and the person's mouth) to the image capture/microphone system, and (2) extract the most relevant features from the image to help inform the speech recognition process. Often, image preprocessing is typically performed using hand-crafted filters. However, the method of the present invention uses DNN's for image processing, learning the most relevant image features for the speech recognition tasks directly from data collected.

(8) As such, image preprocessing is based on recurrent DNN filters. An overview of this approach is shown in FIG. 1. In this approach, each pixel in the captured image data (i.e. the input image) is classified as belonging to one or more classes of interest, for example, head, eye, or mouth of the person. A single DNN classifier (approx. 6464 in size) is applied to each pixel in the input image and the output of the classifier is used to generate a probability distribution of classes for each pixel in the image. A region-of-interest (ROI) for a specific class label can then be defined, maximizing the probability of that class being within the ROI.

(9) To improve the consistency across neighboring frames in an image stream, rather than just using knowledge from the current frame for pixel-level classification, in one embodiment a recurrent DNN model is utilized, where information from the previous frame (t.sub.i1) is used when classifying the same pixel location in frame (t.sub.i). By introducing the recurrent model, the robustness of the system improves significantly due to the image tracking capabilities that is introduced.

(10) The approach is able to locate if a person is present in the image data and to provide the relative position of the person to the image capture device. Further, the method extracts a region of interest around the person's mouth that can subsequently be used for audio-visual speech recognition. As a person having skill in the art will appreciate, the effectiveness of this approach depends in part on image resolution and DNN model structures, which can vary depending on the application.

(11) Once a ROI around the mouth of the person is detected, the region is scaled to an appropriate size and combined with similar mouth ROI in neighboring frames. For example, FIG. 2 shows a window of 5 frames (t.sub.i2, t.sub.i1, t.sub.i, t.sub.i+1, t.sub.i+2) with the ROI isolated. By performing alignment with the acoustic data, a DNN classifier for the image stream can be trained to classify individual frames into context-dependent phonetic states as used by the acoustic model.

(12) Once the image preprocessing process is complete, the method can utilize one of several methods to combine audio and visual information within a single DNN classifier. Given acoustic features from one or more microphones and visual features (YUV pixel values) for the ROI of the mouth over a specific time window, the classifier will be trained to generate the observation probabilities for the speech recognition engine. During training, acoustic frames will be automatically aligned and labeled with a specific context-dependent phonetic state. During the speech recognition process audio and image frames will be captured, feature extraction will be performed, and then a joint audio-visual observation model will be applied to generate the observation probabilities for the context-dependent phonetic states (i.e. HMM state likelihoods) used within the acoustic model. A search is then conducted as in a standard audio-only speech recognition engine. Examples of combining the audio and visual information can include early combination and late combination. Further, independent or joint training can be utilized.

(13) An example of two different network structures is shown in FIG. 3. In the first model (Late Combination), information from the audio and visual streams is not shared until the output layer of the model. This structure differs significantly from the Early Combination model, in which information across the audio and visual streams are shared in the second hidden layer of the DNN. Performance of the speech recognition will be impacted by the manner in which the information is combined. In addition to the model structures shown in FIG. 3, alternative embodiments use recurrent neural networks as shown in FIG. 4. Other factors that affect performance of the system include the number and size of hidden layers used within the model, the size of any audio or image specific sub-networks within the model, and the number of recurrent connections.

(14) Leveraging DNN methods for both the image preprocessing and audio-visual speech recognition components enables use of a consistent architecture throughout the system and integration into a WFST-based speech recognition engine.

(15) While the disclosure has been described in detail and with reference to specific embodiments thereof, it will be apparent to one skilled in the art that various changes and modification can be made therein without departing from the spirit and scope of the embodiments. Thus, it is intended that the present disclosure cover the modifications and variations of this disclosure provided they come within the scope of the appended claims and their equivalents.

System and method for audio-visual speech recognition

Assignee

Inventors

Cpc classification

Classification Explorer

G10L15/02

PHYSICS

Classification Explorer

G06T2207/10004

PHYSICS

Classification Explorer

G06T2207/10016

PHYSICS

Classification Explorer

G06T2207/20076

PHYSICS

Classification Explorer

G06T2207/30201

PHYSICS

Classification Explorer

G06F18/24

PHYSICS

Classification Explorer

G06T7/11

PHYSICS

Classification Explorer

G10L15/25

PHYSICS

Classification Explorer

G10L15/16

PHYSICS

Classification Explorer

G06T2207/10024

PHYSICS

Classification Explorer

G06T2207/20084

PHYSICS

Classification Explorer

G06F18/253

PHYSICS

Classification Explorer

G06T7/73

PHYSICS

Classification Explorer

G06V10/806

PHYSICS

Classification Explorer

G06T2207/20081

PHYSICS

Classification Explorer

G06V40/168

PHYSICS

International classification

Classification Explorer

G10L15/00

PHYSICS

Classification Explorer

G06K9/62

PHYSICS

Classification Explorer

G10L15/26

PHYSICS

Classification Explorer

G10L15/16

PHYSICS

Classification Explorer

G06K9/00

PHYSICS

Classification Explorer

G10L15/02

PHYSICS

Classification Explorer

G06T7/11

PHYSICS

Classification Explorer

G10L15/25

PHYSICS

Abstract

Claims

Description