Methods for hybrid GPU/CPU data processing
09558748 ยท 2017-01-31
Assignee
Inventors
Cpc classification
G10L15/22
PHYSICS
G10L15/34
PHYSICS
International classification
G10L15/22
PHYSICS
Abstract
The present invention describes methods for performing large-scale graph traversal calculations on parallel processor platforms. The invention describes methods for on-the-fly hypothesis rescoring that utilizes graphic processing units (GPUs) in combination with utilizing central processing units (CPUs) of computing devices. The invention is described in one embodiment as applied to the task of large vocabulary continuous speech recognition.
Claims
1. A computer-implemented statistical-inference method for graph-traversal used for speech recognition, comprising the method steps of: providing a computing platform comprising at least one Central Processing Unit (CPU) and at least one Graphical Processing Unit (GPU); initializing likelihoods; capturing a sample for an input signal, wherein the sample comprises a unit of speech; computing sample features from the sample; using the GPU, computing observation likelihoods for each of the sample features based on at least one of acoustic model and a language model; using the GPU, updating likelihoods based on observation likelihoods for the each of the sample features and likelihoods in previous time step to form a set of partial hypotheses, wherein each partial hypothesis of the set of partial hypotheses comprises a sequence of tokens from the at acoustic model or language model that potentially matches the unit of speech; using the CPU, computing likelihood corrections for each partial hypothesis of the set of partial hypotheses; using the GPU, updating likelihoods for the each partial hypothesis of the set of partial hypotheses; synchronizing a first back-track table on the CPU to match a second back-track table on the GPU: and performing back-track on the set of updated partial hypotheses and determining a most likely partial hypothesis of the set of updated partial hypotheses.
2. The method of claim 1, wherein the step of computing observation likelihoods for each of the sample features comprises: computing the observation likelihoods for each of the sample features using a Gaussian mixture model.
3. The method of claim 1, wherein the step of computing observation likelihoods for each of the sample features comprises: computing the observation likelihoods for each of the sample features using a neural network.
4. The method of claim 1, wherein the step of computing likelihood corrections for each partial hypothesis of the set of partial hypotheses comprises: computing likelihood corrections for each partial hypothesis of the set of partial hypotheses using an n-gram language model.
5. The method of claim 1, wherein the step of computing likelihood corrections for each partial hypothesis of the set of partial hypotheses comprises: computing likelihood corrections for each partial hypothesis of the set of partial hypotheses using a neural network.
6. The method of claim 1, further comprising: sorting each partial hypothesis of the set of partial hypotheses; and removing partial hypotheses that fall below a ranking threshold.
7. The method of claim 1, further comprising: creating a data structure to store a probability of a history as an out-going arc, wherein the history represents a token sequence, wherein the data structure used to store an out-going set of arcs changes based on a number of arcs exiting a state.
8. The method of claim 7, further comprising: storing the out-going set of arcs in a first database; creating a search model based on the out-going set of arcs; revising the search model based on a state of the search model; and saving a revised search model in a second database.
9. The method of claim 1, wherein the step of performing back-track on the set of updated partial hypotheses and determining a most likely partial hypothesis of the set of updated partial hypotheses comprises: merging the set of updated partial hypotheses using an atomic compare-and-swap operation.
10. A computer-implemented statistical-inference method for graph-traversal, comprising the method steps of: providing a parallel processor computing platform comprising a first processing unit and a second processing unit; initializing likelihoods; capturing a sample for an input signal, wherein the sample comprises a unit of speech; computing sample features from the sample; using the second processing unit computing observation likelihoods for each of the sample features based on at least one of an acoustic model and a language model; using the second processing unit updating likelihoods based on observation likelihoods for the each of the sample features and likelihoods in previous time step to form a set of partial hypotheses, wherein each partial hypothesis of the set of partial hypotheses comprises a sequence of tokens from the at acoustic model or language model that potentially matches the unit of speech; using the first processing unit computing likelihood corrections for each partial hypothesis of the set of partial hypotheses; using the second processing unit updating likelihoods for the each partial hypothesis of the set of partial hypotheses; synchronizing a first back-track table on the first processing unit to match a second backtrack table on the second processing unit; and performing back-track on the set of updated partial hypotheses and determining a most likely partial hypothesis of the set of updated partial hypotheses.
11. The method of claim 10, further comprising: sorting each partial hypothesis of the set of partial hypotheses; and removing partial hypotheses that fall below a ranking threshold.
12. The method of claim 10, further comprising: creating a data structure to store a probability of a history as an out-going arc, wherein the history represents a token sequence, wherein the data structure used to store an out-going set of arcs changes based on a number of arcs exiting a state.
13. The method of claim 12, further comprising: storing the out-going set of arcs in a first database; creating a search model based on the out-going set of arcs; revising the search model based on a state of the search model; and saving a revised search model in a second database.
14. The method of claim 10, wherein the step of performing back-track on the set of updated partial hypotheses and determining a most likely partial hypothesis of the set of updated partial hypotheses comprises: merging the set of updated partial hypotheses using an atomic compare-and-swap operation.
Description
BRIEF DESCRIPTION OF THE FIGURES
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)
(19)
(20)
(21)
(22)
(23)
DETAILED DESCRIPTION OF THE INVENTION
(24) The present invention describes methods for performing large-scale graph traversal calculations on parallel processor platforms. The invention describes methods for on-the-fly hypothesis rescoring that utilizes graphic processing units (GPUs) in combination with utilizing one or more central processing units (CPUs) of computing devices. The invention is described in one embodiment as applied to the task of large vocabulary continuous speech recognition (LVCSR). Those skilled in the art will recognize that the methods of the present invention are applicable to other statistical inference tasks in which large-scale graph traversal computation is required; examples include hand-writing recognition, image-based gesture recognition, and image understanding.
(25) In this invention, we describe a novel on-the-fly hypotheses rescoring algorithm for Hybrid GPU/CPU architectures. An overview of the approach of the present invention is shown in
(26) Given an input signal (item 1) and a set of statistical models (items 15, 16 and 18), search is performed to find the best token-sequence (item 2) jointly through the models that most closely matches the input signal.
(27) In one embodiment of this invention, namely for the task of Large Vocabulary Speech Recognition (LVCSR), the input signal (item 1) could be digital audio captured from a microphone and the output (item 2) the word sequence most closely matching what the user uttered. For the Large Vocabulary Speech Recognition (LVCSR) task, item 15 would map to an Acoustic Model (for example a Gaussian Mixture Model or Deep Neural Network-based Stochastic Model), item 16 would be a fully composed H-level Weighted Finite State Transducer (WFST)-based search graph (a demonstrative example of such a graph is shown in
(28) In the presented invention, search is performed in the following manner. On initialization (
(29) When the INPUT SIGNAL is first received a sample is captured (step 4) and a set of representative features are computed or extracted from the sample (step 5). The resulting feature vector is a low dimension representation of the most informative aspects of the captured signal. One or more feature vectors could be computed per sample depending on the application task.
(30) In one embodiment of this invention, namely LVCSR, the input signal (item 1) is digital audio. For this task, an audio sample of 25 mS could be captured (steps 4 and 13) every 10 mS allowing overlap of the input signal. Acoustic features could then computed for each 25 mS sample of Audio. Standard acoustic features that could be used for this task include log-Mel filterband energy and Mel-Frequency-Cepstral-Coefficient (MFCC) features. A similar approach can also be used when an image or audio-visual signal is the input signal. Sample Capture (steps 4 and 13) and Feature Computation (step 5) can be performed on any compute device, using either CPU or GPU compute architectures. They do not necessarily need to be performed on the same compute device as the other steps in this invention.
(31) Once a feature vector has been generated (step 5), an N-best search is then performed on the GPU (steps 6, 7, 8, 11 and 12), leveraging the CPU to compute incorporate model likelihood corrections (steps 9 and 10), using a large token sequence model stored in CPU main memory (item 18). During search, first observation likelihoods are computed (step 6) for each new feature vector using a statistical observation model (item 15), for example a Gaussian Mixture Model or Deep Neural Network-based Model. In step 7, state-likelihoods, using equation (1), based on the observation likelihood computed in step 6, and the state-likelihoods in the previous time step. The state likelihood [g] of a new partial hypothesis g is calculated as:
[g]=[g]+[e]+[e].(1)
(32) where [e] is the observation likelihood of the input symbol i[e] (computed using item 15), [e] is the state transition probability (from item 16), and [g] is the state likelihood from the previous time-synchronous (step 6)
(33) In the presented invention, a model likelihood correction, c[e, h[g]], is introduced, to enable a significantly larger model to be applied on-the-fly:
[g]=[g]+[e]+[e]+c[e,h[g]](2)
(34) The model likelihood correction factor, c[e, h[g]], is the difference between the model likelihoods of a smaller model P.sub.uni(o[e]) (i.e. a language model which was applied during WFST composition) and a much larger model, P.sub.ngm(o[e]|h[g]) (item 18) which is applied during search:
c[e,h[g]]=log(P.sub.uni(o[e]))log(P.sub.ngm(o[e]|h[g])).(3)
(35) where h[g] is the output symbol sequence of the hypothesis g
(36) During the search process, if the output symbol sequence of any partial hypothesis g (h[g]) changes (step 8) rescoring is then performed for that hypothesis. In the rescoring stage, first, the model likelihood correction factor, c[e, h[g]] (equation 3) is computed (step 9). Here, a large model (item 18), stored in CPU memory is used to compute P.sub.ngm(o[e]|h[g]). Next, the state likelihood [g] of the partial hypothesis is updated using the model likelihood correction factor, as described in equation 2 (step 10).
(37) After rescoring is performed, partial hypotheses are ranked based on their state likelihoods [g] and those that fall below a ranking threshold are removed from consideration (step 11).
(38) This search process (steps 6, 7, 8, 9, 10, 11, 12, 13) is iterated until the last sample is encountered (step 12). At which point back-track is performed on the CPU (step 14) generating the final 1-best hypothesis output (item 2). A back-track table (a table storing the list of active partial hypotheses, and back-pointers to earlier frames in which the partial hypothesis changed) is shared in working memory between the GPU (item 17) and CPU (item 19). To keep this table synchronized, the CPU copies across back-track information from the GPU to the CPU (item 20) after each sample has been processed.
(39) In one embodiment, rather than waiting for the last sample to be encountered before performing back-track (step 12), back-track is performed on the CPU (step 14) every 20 samples, or at a point where the single-best output will not change in the future.
(40) In the presented invention, n-best Viterbi search is conducted by assigning an n-best hypotheses list to each arc and state. When multiple arcs meet at the same state, n-best hypotheses lists are merged and the Viterbi search chooses n minimally weighted hypotheses across all n-best hypotheses lists. In
(41) The present invention has been described in accordance with several exemplary embodiments, which are intended to be illustrative in all aspects rather than restrictive. In particular, the invention has been described in one embodiment as applied to large vocabulary continuous speech recognition computations. Those skilled in the art will recognize that the invention is applicable to 5 other types of large-scale graph traversal computations. Thus, the present invention is capable of many variations in detailed implementation, which may be derived from the description contained herein by a person of ordinary skill in the art. All such variations and other variations are considered to be within the scope and spirit of the present invention.
Demonstration Example
(42) To demonstrate the effectiveness of the proposed on-the-fly rescoring method, we present a set of example models for the task of speech recognition in
(43) First, in order to prepare the search graph (
(44) The methods to select the optimal model entries from a large model to generate a small model and the search are shown in
(45)
(46) Experimental Evaluation
(47) We evaluated effectiveness of our presented invention on for the task of Large Vocabulary Continuous Speech Recognition (LVCSR) using a large vocabulary version of the WSJ task. We used a combined evaluation set consisting of the November 1992 ARPA WSJ 5 k evaluation set (330 sentences) and the November 1993 ARAP WSJ 20 k evaluation set (213 sentences, open vocabulary).
(48) Our acoustic model was trained using the Kaldi toolkit [22] on the WSJ data set with 39 dimensional MFCCs feats with a LDA transformation. The resulting acoustic model contained 240K Gaussians and 2,946 phonetic states.
(49) TABLE-US-00001 TABLE 1 Size of WFSTs for various language models. Vocab. n-gram # of n-gram WFST (MB) LM binary (MB) 5K 1 5.0K 2 1 2 840.0K 96 13 3 4.3M 684 76 1M 1 1.0M 91 1 2(pruned) 15.1M 2,342 357 3(pruned) 10.1M 3,583 407 4 769.9M 14,407
(50) WFSTs were composed and optimized offline for efficient parallel time-synchronous graph traversal for GPU-accelerated platform as described in [5, 6]. Table. 1 shows the size of the final fully-composed WFSTs for 5 k and 1000 k vocabulary language models. We evaluate the proposed algorithm using a single GPU in a 4-way NVIDIA Tesla K20 GPU server with two Intel Xeon E5-2640 6-core CPUs. The NVIDIA Tesla K20 GPU contains 13 Streaming Multiprocessors (SMX) with 2,496 CUDA cores and 5 GB GDDR5 memory. The operating system was Ubuntu 12.04 LTS (64 bits) and the decoder was compiled with g++4.6 and nvcc5.0 [23]. In the following section we compare the following composition schemes: Standard approach using WFST composed with 3-gram language model (STD-3), the lattice generation rescored 2-gram/3-gram language model combination (LATR-2.3) using a lattice beam as lbw2.3, and proposed on-the-fly rescored 2-gram/3-gram combination (OTFR-2.3) conducting n2.3-best search with m2.3 threads.
(51) Accuracy Performance:
(52) In the first evaluation, we attempted to validate the accuracy of the proposed rescoring approach (OTFR) using a small vocabulary 5K test set. We generated 3 fully composed WFST using the same knowledge sources but different language models. For the 1-gram and 2-gram case, we applied our proposed on-the-fly hypothesis rescoring approach. Decoding with N1, N2 and rescored on-the-fly using a 3-gram language model obtained similar Word Error Rate (WER) to the STD-3 case and LATR-2.3 when n1.3, n2.3 were 9, 3 respectively. n1.3 is larger than n2.3 to achieve comparable WER, 5.4% compared with n2.3 as explained in 3. Similarly, LATR-2.3 requires wider lbw2.3 to achieve the comparable accuracy for the given global beam width across all decoding approaches. In the large vocabulary (1M vocabulary) evaluation, OTFR-2.4 showed 0.3% absolute accuracy degradation compared with OTFR-3.4 case when the n2.4, n3.4 was 3. This reveals lower order language model as well as larger vocabulary requires more n to achieve comparable accuracy with standard and lattice rescoring algorithms.
(53) Speed Performance:
(54) In the second evaluation, we evaluated the decoding speed using both the single core implementation and GPU-accelerated multicore implementation. The baseline implementation is optimized using Automatically Tuned Linear Algebra Software (ATLAS) to accelerate the acoustic weight computation process. The GPU/CPU evaluations in
(55) A Method for Merging Sorted Lists on Highly Parallel Compute Architectures: Atomic Merge-and-Sort
(56) Maintaining n-best hypotheses during the Viterbi search has many parallelization challenges. The most important challenge is the merging of n-best lists on reconvergent paths. We chose to maintain a sorted n-best list with the minimally weighted hypothesis on top. This simplifies the process of merging n-best hypotheses lists into a process of merge-and-sore. On a CPU, this process can be performed quite efficiently.
(57) On a GPU, however, there may be hundreds of arcs, each with an n-best list, trying to write into the n-best list at the destination state at the same time. We developed a new method to merge the n-best list atomically on a highly parallel platform using atomic Compare-And-Swap operations as shown in
(58) An Architecture for Fast and Memory Efficient Look-Up of Markov Model Probabilities
(59) The on-the-fly hypothesis rescoring approach, described above, enables large models (typically Markov models, which provide a probability for token sequence) to be introduced into WFST search on the GPU. When a token boundary is encountered during search on the GPU, the partial hypotheses (which consists of a history stateID, WFST weight and current token) are rescored using a larger model stored on the CPU. The efficiency of model look-up (shown in
(60) To reduce the time required for this step we developed a novel graph-based model structure optimized for rapid hypothesis rescoring. An overview of this model is shown in
(61)
(62)
(63) To reduce the memory foot-print required for this model we use a novel method where the data structure used to store the out-going sets of arcs changes depending on the number of arcs exiting a state. We implemented a prototype that used four data structures, namely: HASH-MAP (default), STATIC ARRAY, BINARY-SEARCH TREE, or a single INTEGER, however other data structures could also be used for this purpose.
(64) The method described in
(65) Experimental Evaluation
(66) In a prototype, we evaluated using heterogeneous data structures as described above to store probabilities from a very large Language Models (1 million words, 488 Million probabilities). First, we replaced HASH-MAPS that only had a single entry with a single INTEGER. This reduced the required memory footprint by 21.4%. From 28.64 GBytes to 22.50 GBytes.
(67) Next, we replaced HASH-MAPS that had more that 90% of the vocabulary in the out-going arcs with a STATIC ARRAY data structure that matched the size of the total vocabulary. In this case, tokens that did not occur in the out-going arc set are identified (with one-lookup) within this array. Memory usage was reduced from 22.50 to 22.47 GBytes using this approach.
(68) Finally, we replaced HASH-MAPS that had more that T.sub.arc out-going arcs with a BINARY-SEARCH TREE. By altering the value of T.sub.arc from 2 to 10,000 (<1% of the total vocabulary size), we could reduce the memory usage to 10.5 GBytes, a 63.4% reduction in memory usage compared to the original HASH-MAP-based implementation. T.sub.arc can be optimized for a specific platform or data set, memory footprint vs. lookup speed. Experimental results are shown in
(69) While the disclosure has been described in detail and with reference to specific embodiments thereof, it will be apparent to one skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the embodiments. Thus, it is intended that the present disclosure cover the modifications and variations of this disclosure provided they come within the scope of the appended claims and their equivalents.