Systems and Methods for Assisted Translation and Lip Matching for Voice Dubbing
20230039248 · 2023-02-09
Inventors
Cpc classification
G10L15/02
PHYSICS
G10L15/25
PHYSICS
G10L13/02
PHYSICS
G10L21/0356
PHYSICS
G10L2015/025
PHYSICS
G10L2021/105
PHYSICS
International classification
G10L13/02
PHYSICS
G10L15/02
PHYSICS
Abstract
Systems and methods for generating candidate translations for use in creating synthetic or human-acted voice dubbings, aiding human translators in generating translations that match the corresponding video, automatically grading how well a candidate translation matches the corresponding video, suggesting modifications to the speed and/or timing of the translated text to improve the grading of a candidate translation, and suggesting modifications to the voice dubbing and/or video to improve the grading of a candidate translation. In that regard, the present technology may be used to fully automate the process of generating lip-matched translations and associated voice dubbings, or as an aid for human-in-the-loop processes that may reduce or eliminate the time and effort required from translators, adapters, voice actors, and/or audio editors to generate voice dubbings.
Claims
1. A computer-implemented method comprising: generating, using one or more processors of a processing system, a synthesized audio clip based on a sequence of text using a text-to-speech synthesizer, the synthesized audio clip comprising synthesized speech corresponding to the sequence of text; and for each given video frame of a video clip comprising a plurality of video frames: processing the video clip, using the one or more processors, to obtain a given image based on the given video frame; processing the synthesized audio clip, using the one or more processors, to obtain a given segment of audio data corresponding to the given video frame; processing the given segment of audio data, using the one or more processors, to generate a given audio spectrogram image; and generating, using the one or more processors, a frame-level speech-mouth consistency score for the given video frame based on the given image and the given audio spectrogram image using a speech-mouth consistency model.
2. The method of claim 1, further comprising generating, using the one or more processors, an overall score based at least in part on the generated frame-level speech-mouth consistency score corresponding to each given video frame of the plurality of video frames.
3. The method of claim 1, further comprising: identifying, using the one or more processors, a set of the generated frame-level speech-mouth consistency scores corresponding to a given word of the sequence of text; and generating, using the one or more processors, a word-level speech-mouth consistency score for the given word based on the identified set of the generated frame-level speech-mouth consistency scores.
4. The method of claim 3, further comprising generating, using the one or more processors, an overall score based at least in part on the generated word-level speech-mouth consistency score corresponding to each given word of the sequence of text.
5. The method of claim 1, further comprising generating, using the one or more processors, a duration score based on a comparison of a length of the synthesized audio clip and a length of the video clip.
6. The method of claim 1, further comprising; processing, using the one or more processors, the video clip to identify a set of one or more mouth-shapes-of-interest from a speaker visible in the video clip; and for each given mouth-shape-of-interest of the set of one or more mouth-shapes-of-interest, correlating, using the one or more processors, the given mouth-shape-of-interest to one or more video frames of the plurality of video frames.
7. The method of claim 1, wherein the video clip further comprises original audio data, and the method further comprises: processing, using the one or more processors, the original audio data to identify one or more words or phonemes being spoken by a speaker recorded in the original audio data; generating, using the one or more processors, a set of one or more mouth-shapes-of-interest based on the identified one or more words or phonemes; and for each given mouth-shape-of-interest of the set of one or more mouth-shapes-of-interest, correlating, using the one or more processors, the given mouth-shape-of-interest to one or more video frames of the plurality of video frames.
8. The method of claim 1, further comprising: processing, using the one or more processors, a transcript of the video clip to identify one or more words or phonemes; generating, using the one or more processors, a set of one or more mouth-shapes-of-interest based on the identified one or more words or phonemes; and for each given mouth-shape-of-interest of the set of one or more mouth-shapes-of-interest, correlating, using the one or more processors, the given mouth-shape-of-interest to one or more video frames of the plurality of video frames.
9. The method of claim 1, further comprising: processing, using the one or more processors, the synthesized audio clip to identify one or more words or phonemes being spoken in the synthesized speech of the synthesized audio clip; generating, using the one or more processors, a set of one or more mouth-shapes-of-interest based on the identified one or more words or phonemes; and for each given mouth-shape-of-interest of the set of one or more mouth-shapes-of-interest, correlating, using the one or more processors, the given mouth-shape-of-interest to one or more video frames of the plurality of video frames.
10. The method of claim 1, further comprising: processing, using the one or more processors, the sequence of text to identify one or more words or phonemes; generating, using the one or more processors, a set of one or more mouth-shapes-of-interest based on the identified one or more words or phonemes; and for each given mouth-shape-of-interest of the set of one or more mouth-shapes-of-interest, correlating, using the one or more processors, the given mouth-shape-of-interest to one or more video frames of the plurality of video frames.
11. The method of claim 2, further comprising: selecting the synthesized audio clip, using the one or more processors, based on the overall score satisfying a predetermined criteria; combining, using the one or more processors, the synthesized audio clip with the video clip to generate a modified video; and outputting, using the one or more processors, the modified video.
12. The method of claim 4, further comprising: selecting the synthesized audio clip, using the one or more processors, based on the overall score satisfying a predetermined criteria; combining, using the one or more processors, the synthesized audio clip with the video clip to generate a modified video; and outputting, using the one or more processors, the modified video.
13. A system comprising: a memory; and one or more processors coupled to the memory and configured to: using a text-to-speech synthesizer, generate a synthesized audio clip based on a sequence of text, the synthesized audio clip comprising synthesized speech corresponding to the sequence of text; and for each given video frame of a video clip comprising a plurality of video frames: process the video clip to obtain a given image based on the given video frame; process the synthesized audio clip to obtain a given segment of audio data corresponding to the given video frame; process the given segment of audio data to generate a given audio spectrogram image; and using a speech-mouth consistency model, generate a frame-level speech-mouth consistency score for the given video frame based on the given image and the given audio spectrogram image.
14. The system of claim 13, wherein the one or more processors are further configured to generate an overall score based at least in part on the generated frame-level speech-mouth consistency score corresponding to each given video frame of the plurality of video frames.
15. The system of claim 13, wherein the one or more processors are further configured to: identify a set of the generated frame-level speech-mouth consistency scores corresponding to a given word of the sequence of text; and generate a word-level speech-mouth consistency score for the given word based on the identified set of the generated frame-level speech-mouth consistency scores.
16. The system of claim 15, wherein the one or more processors are further configured to generate an overall score based at least in part on the generated word-level speech-mouth consistency score corresponding to each given word of the sequence of text.
17. The system of claim 13, wherein the one or more processors are further configured to generate a duration score based on a comparison of a length of the synthesized audio clip and a length of the video clip.
18. The system of claim 13, wherein the one or more processors are further configured to: process the video clip to identify a set of one or more mouth-shapes-of-interest from a speaker visible in the video clip; and for each given mouth-shape-of-interest of the set of one or more mouth-shapes-of-interest, correlate the given mouth-shape-of-interest to one or more video frames of the plurality of video frames.
19. The system of claim 13, wherein the video clip further comprises original audio data, and wherein the one or more processors are further configured to: process the original audio data to identify one or more words or phonemes being spoken by a speaker recorded in the original audio data; generate a set of one or more mouth-shapes-of-interest based on the identified one or more words or phonemes; and for each given mouth-shape-of-interest of the set of one or more mouth-shapes-of-interest, correlate the given mouth-shape-of-interest to one or more video frames of the plurality of video frames.
20. The system of claim 13, wherein the one or more processors are further configured to: process a transcript of the video clip to identify one or more words or phonemes; generate a set of one or more mouth-shapes-of-interest based on the identified one or more words or phonemes; and for each given mouth-shape-of-interest of the set of one or more mouth-shapes-of-interest, correlate the given mouth-shape-of-interest to one or more video frames of the plurality of video frames.
21. The system of claim 13, wherein the one or more processors are further configured to: process the synthesized audio clip to identify one or more words or phonemes being spoken in the synthesized speech of the synthesized audio clip; generate a set of one or more mouth-shapes-of-interest based on the identified one or more words or phonemes; and for each given mouth-shape-of-interest of the set of one or more mouth-shapes-of-interest, correlate the given mouth-shape-of-interest to one or more video frames of the plurality of video frames.
22. The system of claim 13, wherein the one or more processors are further configured to: process the sequence of text to identify one or more words or phonemes; generate a set of one or more mouth-shapes-of-interest based on the identified one or more words or phonemes; and for each given mouth-shape-of-interest of the set of one or more mouth-shapes-of-interest, correlate the given mouth-shape-of-interest to one or more video frames of the plurality of video frames.
23. The system of claim 14, wherein the one or more processors are further configured to: select the synthesized audio clip based on the overall score satisfying a predetermined criteria; combine the synthesized audio clip with the video clip to generate a modified video; and output the modified video.
24. The system of claim 16, wherein the one or more processors are further configured to: select the synthesized audio clip based on the overall score satisfying a predetermined criteria; combine the synthesized audio clip with the video clip to generate a modified video; and output the modified video.
25. A non-transitory computer readable medium comprising instructions which, when executed, cause one or more processors to perform a method comprising: generating a synthesized audio clip based on a sequence of text using a text-to-speech synthesizer, the synthesized audio clip comprising synthesized speech corresponding to the sequence of text; and for each given video frame of a video clip comprising a plurality of video frames; processing the video clip to obtain a given image based on the given video frame; processing the synthesized audio clip to obtain a given segment of audio data corresponding to the given video frame; processing the given segment of audio data to generate a given audio spectrogram image; and generating a frame-level speech-mouth consistency score for the given video frame based on the given image and the given audio spectrogram image using a speech-mouth consistency model.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
[0007]
[0008]
[0009]
[0010]
[0011]
[0012]
[0013]
[0014]
[0015]
[0016]
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
DETAILED DESCRIPTION
[0027] The present technology will now be described with respect to the following exemplary systems and methods.
Example Systems
[0028]
[0029] In the example of
[0030] Further, in some aspects of the technology, the text-to-speech synthesizer 114 may be configured not only to generate synthesized speech corresponding to the input text, but also to allow a user or the processing system to specify one or more aspects of how the input text will be synthesized. For example, in some aspects of the technology, the text-to-speech synthesizer 114 may be configured to allow a user or the processing system to specify: (i) that a pause of a certain duration should be inserted between selected words or phonemes from the input text; (ii) what speech rate should be used when synthesizing the input text, or a specific portion of the input text; and/or (ii) how long the synthesizer should take in pronouncing a particular phoneme or word from the input text.
[0031] Processing system 102 may be resident on a single computing device. For example, processing system 102 may be a server, personal computer, or mobile device, and the models and utilities described herein may thus be local to that single computing device. Similarly, processing system 102 may be resident on a cloud computing system or other distributed system, such that one or more of the models and/or utilities described herein are distributed across two or more different physical computing devices. Likewise, in some aspects, one or more of the modules 112, 114, 116, 118 and 120 may be implemented on a computing device, such as a user computing device or personal computer, and other of the modules may be implemented on a server accessible from the computing device.
[0032] In this regard,
[0033] The processing systems described herein may be implemented on any type of computing device(s), such as any type of general computing device, server, or set thereof and may further include other components typically present in general purpose computing devices or servers. Likewise, the memory of such processing systems may be of any non-transitory type capable of storing information accessible by the processor(s) of the processing systems. For instance, the memory may include a non-transitory medium such as a hard-drive, memory card, optical disk, solid-state, tape memory, or the like. Computing devices suitable for the roles described herein may include different combinations of the foregoing, whereby different portions of the instructions and data are stored on different types of media.
[0034] In all cases, the computing devices described herein may further include any other components normally used in connection with a computing device such as a user interface subsystem. The user interface subsystem may include one or more user inputs (e.g., a mouse, keyboard, touch screen and/or microphone) and one or more electronic displays (e.g., a monitor having a screen or any other electrical device that is operable to display information). Output devices besides an electronic display, such as speakers, lights, and vibrating, pulsing, or haptic elements, may also be included in the computing devices described herein.
[0035] The one or more processors included in each computing device may be any conventional processors, such as commercially available central processing units (“CPUs”), graphics processing units (“GPUs”), tensor processing units (“TPUs”), etc. Alternatively, the one or more processors may be a dedicated device such as an ASIC or other hardware-based processor. Each processor may have multiple cores that are able to operate in parallel. The processor(s), memory, and other elements of a single computing device may be stored within a single physical housing, or may be distributed between two or more housings. Similarly, the memory of a computing device may include a hard drive or other storage media located in a housing different from that of the processor(s), such as in an external database or networked storage device. Accordingly, references to a processor or computing device will be understood to include references to a collection of processors or computing devices or memories that may or may not operate in parallel, as well as one or more servers of a load-balanced server farm or cloud-based system.
[0036] The computing devices described herein may store instructions capable of being executed directly (such as machine code) or indirectly (such as scripts) by the processor(s). The computing devices may also store data, which may be retrieved, stored, or modified by one or more processors in accordance with the instructions. Instructions may be stored as computing device code on a computing device-readable medium. In that regard, the terms "instructions*' and "programs" may be used interchangeably herein. Instructions may also be stored in object code format for direct processing by the processor(s), or in any other computing device language including scripts or collections of independent source code modules that are interpreted on demand or compiled in advance. By way of example, the programming language may be C#, C++, JAVA or another computer programming language. Similarly, any components of the instructions or programs may be implemented in a computer scripting language, such as JavaScript, PHP, ASP, or any other computer scripting language. Furthermore, any one of these components may be implemented using a combination of computer programming languages and computer scripting languages.
Example Methods
[0037]
[0038] The audio spectrogram image 304 shows a spectrogram for a period of time that corresponds to the given video frame. The audio spectrogram image 304 may represent all frequencies of the audio data corresponding to that period of time, or a subset thereof (e.g., the range of frequencies generally corresponding to human voice). Likewise, in some aspects of the technology, the audio spectrogram image 304 may represent audio data for any suitable period of time corresponding to the given video frame. For example, the audio spectrogram image 304 may represent audio data for some number of milliseconds preceding the display of the given video frame. Likewise, in some aspects, the audio spectrogram image 304 may represent audio data corresponding to some or all of the period of time during which the video frame is to be displayed. For example, for a video with 24 frames per second (“fps”) where a new frame is shown every 41.67 ms, the audio spectrogram may represent audio data corresponding to the 41.67 ms that the frame is to be displayed, the first 20 ms that the image is to be displayed, etc. Further, in some aspects, the audio spectrogram image 304 may represent audio data which begins n milliseconds before the display of the given video frame to m milliseconds after the display of the given video frame (where n and m may be the same or different). For example, for a 24 fps video, the audio spectrogram may span 20.83 ms before the frame is to be displayed to 20.83 ms after the frame is to be displayed.
[0039] Likewise, although the example of
[0040] As shown in
[0041]
[0042] In step 402, a processing system (e.g., processing system 102 or 202) extracts a first set of video frames from a given video. This may be all of the frames of the given video, or any subset thereof.
[0043] In step 404, the processing system identify a second set of video frames from within the first set of video frames, each frame of the second set of video frames showing at least the mouth of a speaker. The processing system may make this identification in any suitable way. For example, the processing system may process each video frame in the first set using a first learned model configured to identify a speaker in a given sample of video, and a second learned model to identify a person’s mouth. Likewise, in some aspects of the technology, the processing system may identify the second set of video frames based on pre-assigned labels. In such a case, the pre-assigned labels may have been applied to the video frames in any suitable way. For example, in some aspects, the pre-assigned labels may have been added to each frame of the first set of frames by human annotators. Further, in some aspects, the pre-assigned labels may be added by another processing system (e.g., one configured to identify speakers and their mouths in each frame of the first set of frames, or in the original video).
[0044] In step 406, for each given frame in the second set of frames, the processing system extracts an image from the given frame. As explained above, these images may be the entire given frame or a portion thereof (e.g., a portion showing only the speaker, the speaker’s face, the speaker’s lips, etc.). Likewise, in some aspects of the technology, the processing system may be configured to extract multiple images from the given frame (e.g., one representing the entire given frame, one showing only the speaker, one showing only the speaker’s face, one showing only the speaker’s lips, etc.).
[0045] In step 408, for each given frame in the second set of frames, the processing system generates an audio spectrogram image representing a period of audio data of the given video, the period corresponding to the given frame. As explained above, the audio data processed for each given frame may be from any suitable period of time corresponding to the given frame (e.g., a period of time preceding display of the given frame, a period of time during which the given frame would be displayed, a period of time spanning before and after the frame is to be displayed, etc.).
[0046] In step 410, for each given frame in the second set of frames, the processing system generates a positive training example comprising the image extracted from the given frame, the audio spectrogram image corresponding to the given frame, and a positive training score. As noted above, the positive training score may be based on any suitable scoring paradigm (e.g., -1.0 to 1.0, 0 to 1.0, A to F, textual labels, etc.). As also noted above, where the image in a positive training example is not isolated to the speaker, the training example will further comprise a label identifying the speaker and/or the speaker’s face or mouth.
[0047] In step 412, the processing system generates a set of negative training examples, each negative training example of the set of negative training examples being generated by substituting the image or the audio spectrogram image of one of the positive training examples with the image or the audio spectrogram image of another one of the positive training examples, and each negative training example including a negative training score. This may be done in any suitable way. For example, in some aspects of the technology, negative training examples may be generated by randomly selecting a pair of positive training examples, and swapping the audio spectrogram images for the selected positive training examples to generate a pair of negative training examples.
[0048] Likewise, to avoid the possibility that two randomly selected positive training examples may be too visually similar (e.g., the speaker’s lips forming the same viseme), the processing system may be configured to identify the phonemes being spoken in each positive training example, and to avoid swapping audio spectrograms which have phonemes that tend to correlate to similar lip shapes. For example, in some aspects of the technology, the processing system may be configured to identify the phonemes represented in the audio spectrogram for a given positive training example from a pre-existing transcript corresponding to the same period of time represented by the audio spectrogram. In addition, rather than identifying phonemes from a pre-existing transcript, the processing system may also be configured to process the audio spectrogram using an automated speech recognition (“ASR”) utility to identify the words and/or phonemes being spoken in each positive training example.
[0049] Similarly, the processing system may be configured to analyze lip shapes, facial features, and/or facial landmarks in the images of each positive training example, and to avoid swapping audio spectrograms for examples having lip shapes, facial features, and/or facial landmarks that are deemed too similar. In some aspects of the technology, the processing system may be configured to identify lip shapes, facial features, and/or facial landmarks by processing the images using one or more facial landmark detection utilities. Likewise, in some aspects, the processing system may be configured to identify lip shapes, facial features, and/or facial landmarks based on pre-existing labels (e.g., assigned by human annotators, or by a different processing system).
[0050] In step 414, which is optional, the processing system may be configured to generate one or more degraded training examples based on each given positive training example of a set of positive training examples, each degraded training example comprising the image from the given positive training example, an audio spectrogram image representing a period of audio data of the given video that is shifted by a predetermined amount of time relative to the period represented by the audio spectrogram image of the given positive training example, and a degraded training score that is less than the training score of the given positive training example. For example, the processing system may be configured to generate a first set of degraded training examples for each positive training example in which each degraded training example’s audio spectrogram image begins 30 ms later than the positive training example (and lasts the same duration), and the training score for each degraded training example is reduced by a discount factor (e.g., of 0.15/30 ms) to +0.85. Likewise, the processing system may be configured to generate a second set of degraded training examples for each positive training example in which each degraded training example’s audio spectrogram image begins 60 ms later than the positive training example (and lasts the same duration), and the training score for each degraded training example is reduced to +0.70. Similar sets may be created with 90 ms, 120 ms, 150 ms, and 180 ms shifts, and corresponding training scores of +0.55, +0.40, +0.25, and +0.10, respectively. Of course, any suitable discounting paradigm may be used, including ones that are nonlinear, ones based on predetermined scoring tables, etc. Such time-shifted degraded training examples may be useful for teaching the speech-mouth consistency model to recognize where a voice dubbing may not perfectly sync with a speaker’s lips, but yet may be close enough for a viewer to still consider it to be consistent. In that regard, based on the frame rate of the video, speech which precedes or lags video by less than the timing of one frame will generally be imperceptible to human viewers (e.g., a variance of +/-41.67 ms for 24 fps video). Moreover, in practice, some viewers may not begin to notice such misalignments until they approach 200 ms.
[0051] In step 416, which is also optional, the processing system may be configured to generate one or more modified training examples based on each given training example of a set of positive and negative training examples, each modified training example comprising a training score equal to that of the given training example, and one or both of: (i) an edited version of the image of the given training example; or (ii) an audio spectrogram generated from an edited version of the audio data from which the audio spectrogram image of the given training example was generated. The processing system may be configured to edit the image of a given training example in any suitable way. For example, the processing system may edit the image of a given training example by changing its brightness, color, contrast, sharpness and/or resolution, by adding pixel noise or shadow effects to the image, and/or by flipping the image horizontally to generate a mirror-image copy. Likewise, the processing system may be configured to edit the audio data of a given training example in any suitable way. For example, the processing system may edit the audio data of a given training example by changing its volume or pitch, by adding echo or other acoustic effects (e.g., to make the speech sound as though it is being delivered in a cave or large auditorium), by adding other background noise, etc. Training the speech-mouth consistency model using such modified training examples may help reduce the likelihood that the speech-mouth consistency will be confused by audio effects that change the sound of the audio data, but not the content of the speech.
[0052] In step 418, which is also optional, the processing system may be configured to generate one or more synthetic positive training examples based on each given positive training example of a set of positive training examples, each synthetic positive training example comprising the image and positive score of the given positive training example, and an audio spectrogram image based on a synthetic voice dubbing which reproduces the speech in the audio data from which the audio spectrogram image of the given positive training example was generated.
[0053] The processing system may be configured to generate the synthetic voice dubbing from a pre-existing transcript corresponding to the same period of time represented by the given positive training example’s audio spectrogram image. In addition, where a pre-existing transcript is not available, the processing system may also be configured to process the given positive training example’s audio spectrogram image using an ASR utility to identify the words or phonemes being spoken, and then may generate the synthetic voice dubbing based on those identified words or phonemes.
[0054] The training examples generated according to method 400 may be used to train the speech-mouth consistency model according to any suitable training protocol. In that regard, in some aspects of the technology, the speech-mouth consistency model may be trained using batches comprising positive training examples and negative training examples to generate an aggregate loss value, and one or more parameters of the speech-mouth consistency model may be modified between batches based on the aggregate loss value for the preceding batch. Likewise, in some aspects, the batches (or selected batches) may additionally include one or more of the optional types of training examples described with respect to steps 414-418 of
[0055] In that regard,
[0056] In step 502, a processing system (e.g., processing system 102 or 202) generates a plurality of positive training examples. The positive training examples may be generated in any suitable way, including as described above with respect to steps 402-410 of
[0057] In step 504, the processing system generates a first set of negative training examples based on a first subset of the plurality of positive training examples. The first set of negative training examples may be generated in any suitable way. In that regard, the first set of negative training examples may be generated according to any of the options described with respect to step 412 of
[0058] In step 506, the processing system trains a first speech-mouth consistency model based on a first collection of positive training examples from the plurality of positive training examples and the first set of negative training examples. This training may be performed according to any suitable training protocol. For example, in some aspects of the technology, training may be done in a single batch with a single back-propagation step to update the parameters of the first speech-mouth consistency model. Likewise, in some aspects, the first collection of positive and negative training examples may be broken into multiple batches, with separate loss values being aggregated during each batch and used in separate back-propagation steps between each batch. Further, in all cases, any suitable loss values and loss functions may be employed to compare the training score of a given training example to the speech-mouth consistency score generated by the first speech-mouth consistency model for that given training example.
[0059] In step 508, the processing system generates a second set of negative training examples by swapping the images or audio spectrogram images of randomly selected pairs of positive training examples from a second subset of the plurality of positive training examples.
[0060] In step 510, the processing system generates a speech-mouth consistency score for each negative training example in the second set of negative training examples using the first speech-mouth consistency model (as updated in step 506).
[0061] In step 512, the processing system trains a second speech-mouth consistency model based on a second collection of positive training examples from the plurality of positive training examples and each negative training example of the second set of negative training examples for which the first speech-mouth consistency model generated a speech-mouth consistency score below a predetermined threshold value. In this way, step 512 will prevent the second speech-mouth consistency model from being trained using any negative training example which received a speech-mouth consistency score (from the first speech-mouth consistency model) indicating that its image and audio spectrogram image may in fact be consistent. Any suitable threshold value may be used in this regard. For example, for a scoring paradigm from -1.0 to 1.0, the processing system may be configured to use only those negative training examples which received a negative speech-mouth consistency score, or only those which received a score below 0.1, 0.2, 0.5, etc.
[0062] Although exemplary method 500 only involves a first and second speech-mouth consistency model for the sake of simplicity, it will be understood that steps 508-512 may be repeated one or more additional times. For example, the procedure of step 508 may be repeated to generate a third set of negative training examples, the second speech-mouth consistency model may be used according to step 510 to score each negative training example in the third set of negative training examples, and the procedure of step 512 may be repeated to train a third speech-mouth consistency model using those of the third set of negative training examples that scored below the predetermined threshold.
[0063] Further, in some aspects of the technology, the processing system may be configured to use a different predetermined threshold value in one or more successive passes through steps 508-512, For example, to account for the fact that the second speech-mouth consistency model is likely to do a better job of scoring the third set of negative training examples (than the first speech-mouth consistency model did in scoring the second set of negative training examples), the processing system may be configured to apply a lower (i.e., not as negative) predetermined threshold value so that the third speech-mouth consistency model will end up being trained on a broader and more nuanced set of negative training examples.
[0064]
[0065] In step 602, a processing system (e.g., processing system 102 or 202) generates a plurality of positive training examples and a plurality of negative training examples. These positive and negative training examples may be generated in any suitable way, including as described above with respect to steps 402-412 of
[0066] In step 604, the processing system generates a speech-mouth consistency score using the speech-mouth consistency model for each training example of a collection of positive training examples from the plurality of positive training examples and negative training examples from the plurality of negative training examples.
[0067] In step 606, the processing system generates one or more loss values based on the training score and the generated speech-mouth consistency score of: (i) each positive training example of the collection; and (ii) each negative training example of the collection for which the generated speech-mouth consistency score is below a predetermined threshold value. In this way as well, step 606 will prevent the speech-mouth consistency model from being trained using any negative training example which received a speech-mouth consistency score indicating that its image and audio spectrogram image may in fact be consistent. Here again, any suitable threshold value may be used in this regard. For example, for a scoring paradigm from -1.0 to 1.0, the processing system may be configured to only generate loss values for those negative training examples which received a negative speech-mouth consistency score, or only for those which received a score below 0.1, 0.2, 0.5, etc. Further, any suitable loss values and loss functions may be employed to compare the training score of a given training example to the speech-mouth consistency score generated by the speech-mouth consistency model for that given training example.
[0068] In step 608, the processing system modifies one or more parameters of the speech-mouth consistency model based on the generated one or more loss values. As above, the training set forth in steps 604-608 may be performed according to any suitable training protocol. For example, in some aspects of the technology, the scoring, generation of loss values, and modification of the speech-mouth consistency model may all be done in a single batch with a single back-propagation step. Likewise, in some aspects, the collection of positive and negative training examples may be broken into multiple batches, with separate loss values being aggregated during each batch and used in separate back-propagation steps between each batch. Further, in some aspects of the technology, steps 604-608 may be repeated for successive batches of training examples, with a different predetermined threshold value used as training continues. For example, to account for the fact that the speech-mouth consistency model’s predictions are expected to improve the more it is trained, the processing system may be configured to apply lower predetermined threshold values to successive batches.
[0069] As already mentioned, the speech-mouth consistency models of the present technology may be used to more efficiently generate translations and associated lip-matched voice dubbings. In that regard, as will be described further below, the speech-mouth consistency models described herein can be integrated into systems and methods for automatically generating candidate translations for use in creating synthetic or human-acted voice dubbings, aiding human translators in generating translations that match the corresponding video, automatically grading how well a candidate translation matches the corresponding video, suggesting modifications to the speed and/or timing of the translated text to improve the grading of a candidate translation, and suggesting modifications to the voice dubbing and/or video to improve the grading of a candidate translation. Further, the present technology may be used to fully automate the process of generating lip-matched translations and associated voice dubbings, or as an aid for HITL processes that may reduce (or eliminate) the amount of time and effort needed from translators, adapters, voice actors, and/or audio editors to generate voice dubbings.
[0070] In that regard,
[0071] This example also assumes that the processing system has received a video clip from the video (e.g., movie, television show, etc.), and obtained an image from each given video frame of the plurality of video frames in the video clip. The video clip comprises a plurality of video frames corresponding to a period of time in which the given sentence being translated was spoken in the video’s original dialogue.
[0072] Further, this example assumes that the processing system processes the voice dubbing (e.g., synthesized audio clip, human-acted audio clip) to generate a given segment of audio data corresponding to each given video frame, and further processes each given segment of audio data to generate a corresponding audio spectrogram image. However, in some aspects of the technology, a separate processing system may be configured to segment the voice dubbing, and/or to generate the corresponding audio spectrogram images, and provide same to the processing system. As will be understood, just as each segment of audio data has a correspondence to a given video frame in the video clip, each given audio spectrogram image will likewise correspond to a given video frame. In this regard, the processing system may correlate the voice dubbing with the video clip in any suitable way. For example, in some aspects of the technology, the processing system may be configured to correlate the voice dubbing and the video clip such that they each begin at the same time. Likewise, in some aspects, the processing system may be configured to correlate the voice dubbing and the video clip such that the voice dubbing starts at some predetermined amount of time before or after the video clip (e.g., 20 ms, half the length of a video frame, or by an amount that maximizes an overall score or an aggregate speech-mouth consistency score for the voice dubbing). In either case, the voice dubbing may be segmented such that each given segment of audio data has the same length as the given video frame to which it corresponds (e.g., 40.167 ms for 24 fps video).
[0073] Finally, this example assumes that the processing system has used a speech-mouth consistency model to generate frame-level speech-mouth consistency scores corresponding to each given video frame based on its corresponding image and audio spectrogram image. As explained above, these frame-level speech-mouth consistency scores represent the speech-mouth consistency model’s determination of how well the voice dubbing matches each individual frame of the original video. In this regard,
[0074] Specifically, the exemplary layout 700 displays each frame’s speech-mouth consistency score as a separate bar (e.g., bars 702, 704) on a bar graph. The bar graph of
[0075]
[0076] The exemplary layout 800 shows a similar bar graph to that of
[0077]
[0078] The exemplary layout 900 shows a similar bar graph to that of
[0079]
[0080] The exemplary layout 1000 shows a similar bar graph to that of
[0081] Here as well, the processing system may be configured to show the obscured speaker box 1002 despite the speech-mouth consistency model attributing actual scores to these frames. For example, the speech-mouth consistency model may likewise be configured to recognize such obscured speaker situations (e.g., from the absence of a pre-labelled tag identifying the speaker for those frames), and may be further configured to automatically attribute a neutral (e.g., 0) or fully positive (e.g., +1.0) score to any frames falling in such a period. Nevertheless, in order to avoid confusing the translator, the processing system may be configured to ignore those speech-mouth consistency scores and instead display the obscured speaker box 1002 so that the translator will understand that individual speech-mouth consistency scores for those frames can simply be disregarded. In addition, in some aspects of the technology, the processing system may also be configured to simply avoid generating speech-mouth consistency scores for any frames when the speaker or their mouth is not clearly visible, and instead show the obscured speaker box 1002.
[0082]
[0083] The exemplary layout 1100 shows how a candidate translation may be displayed and correlated to the bar graph of
[0084]
[0085] The exemplary layout 1200 shows how mouth shapes identified from the original video may be displayed and correlated to the bar graph and candidate translation of
[0086] In some aspects of the technology, the identified mouth shapes may be identified by a human (e.g., an adapter) or another processing system (e.g., a separate processing system configured to analyze the original video and identify mouth shapes), and provided to the processing system for display in layout 1200.
[0087] Likewise, in some aspects, the mouth shapes may be identified by the processing system itself using one or more facial landmark detection utilities, and/or a visual classifier specifically trained to classify the lip shapes from images. In that regard, the processing system may use the output of the facial landmark detection utility and/or the visual classifier, together with a predetermined list of mouth-shapes-of-interest (e.g., those corresponding to bilabial consonants like "p," "b," and "m," labiodental fricatives like "f' and "v," etc.), to identify which video frames show an identified mouth shape.
[0088] Further, in some aspects, the identified mouth shapes may be identified based on analysis of the words or phonemes spoken in the original video. For example, the processing system may infer the existence of these mouth shapes from the words and/or phonemes of a pre-existing transcript of the speech of the original video (or of the video clip).
[0089] As another example, the processing system may process the audio data of the original video to automatically identify the words or phonemes being spoken in the original video (e.g., using ASR), and may then infer the existence of mouth shapes from those identified words and/or phonemes.
[0090]
[0091] The exemplary layout 1300 shows how mouth shapes identified from the candidate translation may be displayed and correlated to the bar graph and candidate translation of
[0092] Here as well, these identified mouth shapes of the candidate translation may be identified by a human (e.g., an adapter) or another processing system, and provided to the processing system for display in layout 1300. In such a case, the human or the other processing system may further identify which frames of the video clip each identified mouth shape correlates to.
[0093] Likewise, in some aspects of the technology, the processing system may infer the existence of these mouth shapes from the words and/or phonemes of the text of the candidate translation, and a list of mouth-shapes-of-interest (e.g., those corresponding to bilabial consonants like “p,” “b,” and “m,” labiodental fricatives like “f” and “v,” etc.).
[0094] As another example, the processing system may process the voice dubbing (e.g., synthesized audio clip, human-acted audio clip) to automatically identify the words or phonemes being spoken in the voice dubbing (e.g., using ASR), and may then infer the existence of mouth shapes from those identified words and/or phonemes.
[0095] Further, in some aspects, where the voice dubbing is performed by a human, the processing system may identify mouth shapes of interest from a video recording of the human actor using one or more facial landmark detection utilities.
[0096] Each identified mouth shape may be correlated to one or more of the video frames of the video clip in any suitable way. For example, where the voice dubbing is a synthesized audio clip, and the processing system infers the existence of a given mouth-shape-of-interest from one or more words or phonemes in the text of the candidate translation, the processing system may be configured to identify the segment(s) of audio data in which those one or more words or phonemes are spoken, and to correlate the given mouth-shape-of-interest to whichever video frame(s) the identified segment(s) of audio data have been correlated (as discussed above with respect to
[0097] Identifying mouth-shapes-of-interest from the candidate translation or its voice dubbing may be valuable both in HITL applications as well as fully-automated applications. In this regard, in fully-automated applications, mouth-shapes-of-interest identified from the text of the candidate translation or from a synthesized voice dubbing may be compared to mouth-shapes-of-interest identified from the video clip, and used to generate additional scores or to influence an “overall score” (as discussed below). These additional or enhanced overall scores may be used by the processing system to pick a translation that better matches certain conspicuous mouth-shapes-of-interest in the video, and thus may appear better to a human viewer, even though another translation may score slightly better solely based on speech-mouth consistency scores.
[0098]
[0099] The exemplary layout 1400 shows an alternative way of displaying the candidate translation, identified mouth shapes, and speech-mouth consistency scores of
[0100] Aggregating the frame-level scores in this way may be desirable, for example, to give a translator a better way of assessing and comparing various alternative words when translating. Further, in some aspects of the technology, the processing system may be configured to allow the translator to toggle between viewing the speech-mouth consistency scores on a frame-level and an aggregated word-level. Word-level scores may also be beneficial in automated systems. For example, in some aspects of the technology, the processing system may be configured to generate or request additional automated translations where a given word-level score is below a predetermined threshold. This may help prevent the processing system from selecting a translation that, due to one glaring inconsistency, may appear worse to a human viewer than another translation that might score slightly lower based on frame-level speech-mouth consistency scores but lacks any glaring word-level inconsistencies.
[0101] In addition, in exemplary layout 1400, the mouth shapes identified from the original video (1202, 1204, and 1206) have been moved above the bar graph and arranged directly below the mouth shapes identified from the candidate translation (1302, 1304, 1306, and 1308). This may be desirable, for example, so that the translator can more easily see how closely those mouth shapes sync up with each other. Further, although not shown in
[0102] As will be shown and described below, the speech-mouth consistency models described herein, as well as the various visualizations and layouts of
[0103] For example,
[0104] The exemplary layout 1500 displays the original sentence 1502 to be translated, and a text box 1504 directly below it where the translator can enter a translation. Automatically generated translations 1510, 1518, and 1526 are displayed below the text box 1504 as options which may be selected, but the text box 1504 is left blank so as to allow the translator to focus on the original sentence 1502 and have autonomy in choosing how to frame the translation. This can help in reducing an “anchoring effect” which may occur if the translator is instead asked to start from an automatically generated translation and directly edit it to arrive at the final candidate translation.
[0105] In the example of
[0106] As shown in
[0107] Likewise, the contents of the text entry box 1504 are also scored as shown in box 1506. In this case, as no candidate translation has yet been entered into text entry box 1504, the overall score is shown as 0% and the candidate translation is assessed as being 100% short of its target length. In some aspects of the technology, a processing system (e.g., processing system 102 or 202) may be configured to update the scores in box 1506 in real-time as a translator works. Likewise, in some aspects, the processing system may be configured to update the scores in box 1506 on a periodic basis, and/or in response to an update request from the translator.
[0108] The overall scores shown in boxes 1506, 1512, 1520, and 1528 are aggregate values based at least in part on the frame-level scores of the speech-mouth consistency model for each automatically generated translation. Such frame-level speech-mouth consistency scores may be generated by the processing system for each automatically generated translation using a speech-mouth consistency model, according to the processing described above with respect to
[0109] In some aspects of the technology, the overall scores may also be based in part on how many of the original video’s identified mouth shapes are being matched in the translation (e.g., a percentage of how many identified mouth shapes are matched, or a time-weighted average thereof based on how long each mouth shape is on screen). Likewise, in some aspects of the technology, the overall scores may be penalized based on various criteria, such as when the voice dubbing does not match a pause in the video, when the voice dubbing is particularly short or long relative to the original video (e.g., past some predetermined threshold such as 10 ms, 20 ms, 30 ms, 40 ms, etc.), and/or when the speech rate is too fast or too slow relative (e.g., faster or slower than a predetermined range of “normal” speech rates, faster or slower than the preceding voice dubbing by some predetermined percentage, etc.). Further, the overall scores 1506, 1512, 1520, and 1528 may be based on any combination or subcombination of the options just described.
[0110] In some fully automated systems, the processing system may be configured to select a given automatically generated translation based at least in part on its overall score satisfying some predetermined criteria. For example, in some aspects of the technology, the processing system may be configured to select a given automatically generated translation based on its overall score being higher than the overall scores for all other automatically generated translations. Likewise, in some aspects, the processing system may be configured to select a given automatically generated translation based on its overall score being higher than a predetermined threshold. The processing system may further be configured to then combine the video clip with a synthesized audio clip corresponding to the selected automatically generated translation to generate a modified video. The modified video (which may be augmented to include the synthesized voice dubbing as well as the original audio data, or which may be modified to replace a portion of the original audio data of the video with the synthesized audio clip), may be stored on the processing system, and/or output for storage, transmission, or display. Likewise, in some aspects of the technology, the processing system may be configured to select a given automatically generated translation based at least in part on its overall score, and then to output the synthesized audio clip to another processing system for storage and/or use in generating a modified video (e.g., as just described). In this way, a voice dubbing may be automatically generated in a resource-efficient manner.
[0111]
[0112] The exemplary layout 1600 displays the contents of layout 1500 of
[0113]
[0114] The exemplary layout 1700-1 of
[0115] Similarly, the exemplary layout 1700-2 of
[0116] The processing system’s autocompletion utility may be configured to generate suggestions in any suitable way. For example, in some aspects of the technology, the processing system may be configured to base its autocompletion suggestions on the contents of the automatically generated translations 1516, 1524, and 1532 (and, optionally, any other translations that were generated but not chosen for display). In addition, the processing system may also be configured to indicate the basis for any such autocompletion suggestion by highlighting where that suggestion can be found in the automatically generated translations below the text entry box. For example, in
[0117]
[0118] The exemplary layout 1800 displays the contents of layout 1500 of
[0119] Thus, in this case, the translator has typed “The brown bird that I” into text entry box 1504 (as shown by arrow 1802), which is 60% short of the original video and has an updated overall score of 20% (as shown in 1804). Based on this entry, the processing system will issue five separate calls to the translation API to translate the original sentence 1502, each call being based on one of the following five prefixes: (1) “the brown bird that I”; (2) “the brown bird that”, (3) “the brown bird”; (4) “the brown”; and (5) “the.”
[0120] The processing system may be configured to display some or all of the translations returned from the translation API in response to these calls. However, in the example of
[0121] In addition, as the human translator continues to type, the processing system will make additional API calls based on the changing text in box 1504. As a result, the contents of box 1806 will continue to change over time if any of these successive calls results in the translation API returning a translation which scores even better than the one currently shown in box 1806.
[0122] As can be seen, the exemplary layout 1800 also incorporates a visualization showing how well the translator’s candidate translation (entered in text box 1504) matches the original video. This visualization is similar to that shown and described above with respect to
[0123] Further, like
[0124]
[0125] The exemplary layout 1900 displays the contents of the exemplary layout 1800 of
[0126] The processing system may be configured to automatically modify the video to better conform it to the translation, using one or more of the following approaches. For example, where the video must be lengthened to better fit the translation, the processing system may be configured to duplicate one or more video frames in a suitable way. In that regard, where multiple frames must be duplicated, the processing system may be configured to select frames for duplication at predetermined intervals so as to avoid making the video appear to pause. The processing system may also be configured to identify any sequences in which the frames are nearly identical (e.g., where there is very little movement taking place on screen), and duplicate one or more frames within those sequences, as doing so may not be as likely to be noticed by a viewer. In that regard, where a sequence of frames is particularly identical, it may be possible to repeat that set of frames one or more times (thus “looping” the set of frames) without it being noticeable to most viewers. Further, the processing system may be configured to select which frames to duplicate based on how their duplication will impact the synchronization of various mouth-shapes-of-interest between the translation and the modified video.
[0127] Likewise, where the video must be shortened, the processing system may be configured to remove one or more frames in any suitable way. Here as well, where multiple frames must be removed, the processing system may be configured to do so at predetermined intervals, or in sequences where the frames are nearly identical (e.g., where there is very little movement taking place on screen), as doing so would not be as likely to be noticed by a viewer. The processing system may also be configured to select which frames to remove based on how their removal will impact the synchronization of various mouth-shapes-of-interest between the translation and the modified video.
[0128] Further, in some aspects of the technology, the processing system may be configured to use a balanced approach of modifying the video, in which the duration of the video remains unchanged. In such a case, the processing system may be configured to remove one or more frames from one section of the video, and duplicate an equivalent number of frames in a different section of the video, so that the modified version of the video has the same number of frames as the original video. Here as well, the processing system may be configured to choose how and where to remove and insert frames based on how those frame additions and subtractions will impact the synchronization of various mouth-shapes-of-interest between the translation and the modified video.
[0129] Moreover, in some aspects of the technology, the processing system may be configured to use a reanimation utility (e.g., reanimation utility 120) to make modifications to individual frames which alter the appearance of a speaker’s lips, face, and/or body. In some aspects, the processing system may be configured to automatically determine how to make such changes based on how they will impact the synchronization of various mouth-shapes-of-interest between the translation and the modified video. Likewise, in some aspects, the processing system may be configured to allow a human user to use the reanimation utility to make such changes. In such a case, the processing system may further be configured to show the user how their changes to a given frame or frames will impact the speech-mouth consistency scores and/or the voice dubbing’s overall score. In all cases, the processing system may be configured to use the reanimation utility alone, and/or in combination with any of the other video or audio modification options discussed herein.
[0130] In addition, in some aspects of the technology, the processing system may be configured to automatically modify the voice dubbing to better conform it to the video. For example, the processing system may be configured to instruct the text-to-speech synthesizer to lengthen or shorten one or more words, and/or to insert one or more pauses in the translation. The processing system may be configured to do this in order to optimize the overall duration of the voice dubbing, and/or to better synchronize the mouth shapes of the translation with those of the original video. Here again, the processing system may give the translator the ability to listen to the resulting modified voice dubbing, so that he or she can assess how natural the final result ends up being. In some aspects of the technology, the modified voice dubbing may be used as the final voice dubbing. However, in some aspects of the technology, the modified voice dubbing may simply be used as a guide for a human actor, who will then attempt to act out the candidate translation using the same cadence, word lengths, and pauses.
[0131] Moreover, in some aspects of the technology, the processing system may be configured to use one or more the methods set forth above to make modifications to both the audio and video. For example, the processing system may be configured to modify the speed of the synthesized audio to conform it to the length of the video, and then may employ a balanced approach to modifying the video so as to better synchronize the mouth shapes of the voice dubbing and the modified video. In addition, in some aspects of the technology, further changes to the modified video may be made using a reanimation utility.
[0132] Although
[0133]
[0134] In step 2002, the processing system receives a video clip and a sequence of text. In this example, it is assumed that the video clip represents a portion of a video, and comprises a plurality of video frames. In some aspects of the technology, the video clip may also include a corresponding portion of the video’s original audio data, although that is not necessary for the purposes of exemplary method 2000. The sequence of text may be any combination of two or more words, including a sentence fragment, a full sentence, a full sentence and an additional sentence fragment, two or more sentences or sentence fragments, etc. In some aspects of the technology, the sequence of text may be provided to the processing system by a human. For example, a human translator may input the sequence of text through a keyboard. Likewise, a human translator or voice actor may speak the sequence of text into a microphone, and the processing system or another processing system may be configured to convert the recorded voice input into a sequence of text (e.g., using ASR). Further, in some aspects of the technology, the sequence of text may be generated by the processing system using a translation model (e.g., translation utility 112). For example, the processing system may generate the sequence of text by detecting speech in the video’s original audio data (e.g., using ASR) and generating a translation thereof using a translation model. Likewise, the processing system may generate the sequence of text by using a translation model to translate a preexisting transcript (or portion thereof) of the video’s original dialogue. Further, in some aspects of the technology, another processing system may generate the sequence of text in one of the ways just described, and may provide the sequence of text to the processing system of method 2000.
[0135] In step 2004, the processing system generates a synthesized audio clip based on the sequence of text. For example, the processing system may do this by feeding the sequence of text to a text-to-speech synthesizer (e.g., text-to-speech synthesizer 114), as described above with respect to
[0136] Next, for each given video frame of the plurality of video frames, the processing system will perform steps 2006-2012. In that regard, in step 2006, the processing system obtains an image based on the given video frame. The processing system may obtain this image in any suitable way. For example, in some aspects of the technology, the image may simply be an image extracted directly from the video frame. Likewise, in some aspects, the image may be a processed version (e.g., downsampled, upsampled, or filtered version) of an image extracted directly from the video frame. Further, in some aspects, the image may be a cropped version of an image extracted directly from the video frame, such as a portion that isolates the face or mouth of the speaker.
[0137] In step 2008, the processing system processes the synthesized audio clip to obtain a given segment of audio data corresponding to the given video frame. As discussed above with respect to
[0138] In step 2010, the processing system processes the given segment of audio data to generate a given audio spectrogram image. This audio spectrogram image may take any suitable form, and may be generated by the processing system in any suitable way, as described in more detail above with respect to
[0139] In step 2012, the processing system generates a frame-level speech-mouth consistency score for the given video frame based on the given image and the given audio spectrogram image using a speech-mouth consistency model. The processing system and speech-mouth consistency model may generate this frame-level speech-mouth consistency score in any suitable way, as described in more detail above with respect to
[0140] Unless otherwise stated, the foregoing alternative examples are not mutually exclusive, but may be implemented in various combinations to achieve unique advantages. As these and other variations and combinations of the features discussed above can be utilized without departing from the subject matter defined by the claims, the foregoing description of exemplary systems and methods should be taken by way of illustration rather than by way of limitation of the subject matter defined by the claims. In addition, the provision of the examples described herein, as well as clauses phrased as “such as,” “including,” “comprising,” and the like, should not be interpreted as limiting the subject matter of the claims to the specific examples; rather, the examples are intended to illustrate only some of the many possible embodiments. Further, the same reference numbers in different drawings can identify the same or similar elements.