EFFICIENT TRANSFORMER LANGUAGE MODELS WITH DISENTANGLED ATTENTION AND MULTI-STEP DECODING

Abstract

Systems and methods are provided for facilitating the building and use of natural language understanding models. The systems and methods identify a plurality of tokens and use them to generate one or more pre-trained natural language models using a transformer. The transformer disentangles the content embedding and positional embedding in the computation of its attention matrix. Systems and methods are also provided to facilitate self-training of the pre-trained natural language model by utilizing multi-step decoding to better reconstruct masked tokens and improve pre-training convergence.

Claims

1. (canceled)

2. A computing system configured to account for position bias while encoding data with a transformer to improve pre-training convergence of the transformer, the computing system comprising: one or more processors; and one or more computer-readable hardware storage devices that store computer executable instructions that are executable by the one or more processors to cause the computer system to at least: apply input data comprising tokens having position bias embedding to an encoder of a transformer; generate and apply an attention weight for disentangling the position bias embedding and for modifying one or more learnable parameters used by the encoder, the attention weight comprising each of: a first attention component based on a content embedding of the first token in a token pair as well as a content embedding of a second token in the token pair; a second attention component based on the content embedding of the first token in the token pair and a relative position embedding of the first and second tokens in the token pair; and a third attention component based on the relative position embedding of the first and second tokens in the token pair and the second content embedding of the second token in the token pair; and modify the one or more learnable parameters used by the encoder based on output generated by the encoder applying the attention weight to the input data.

3. The computing system of claim 1, wherein the attention weight further includes a fourth attention component comprising a product of an absolute position embedding of the first token in the token pair, the first learn-able position parameter, the second learnable position parameter, and the relative positional embedding of the first and second tokens in the token pair.

4. The computing system of claim 1, wherein the encoder includes a plurality of encoding layers within the encoder.

5. The computing system of claim 4, wherein a final encoding layer of the encoder is a task specific decoding layer, and wherein the computing system applies the output of the task specific decoding layer as new input to the task specific decoding layer for one or more iterations in order to generate new output from the task specific decoding layer.

6. The computing system of claim 5, wherein the computing system applies one or more hidden vector outputs associated with masked tokens from the task specific decoding layer as additional input to the task specific decoding layer to generate additional new output from the decoding layer.

7. The computing system of claim 5, wherein the computing system applies a query vector output of the task specific decoding layer as additional input to the task specific decoding layer to generate additional new output from the decoding layer.

8. The computing system of claim 1, wherein the computing system refrains from applying position bias embedding prior to the encoder.

9. The computing system of claim 1, wherein the computing system generates a separate attention weight for each of a plurality of token pairs.

10. The computing system of claim 9, wherein the computing system applies a maximum relative distance between tokens in each token pair.

11. A storage device having stored computer-executable instructions which are executable by one or more processors of a computing system for causing the computing system to implement a method for improving pre-training convergence while encoding data with a transformer, the computing system comprising: one or more processors; and one or more computer-readable hardware storage devices that store computer executable instructions that are executable by the one or more processors to cause the computer system to at least: identify a plurality of tokens to be encoded from a sequence; obtain a transformer that includes an encoder with a plurality of encoding layers; embed the plurality of tokens to generate input data; apply the input data to the encoder by at least disentangling position bias embedding from content embedding associated with the plurality of tokens; and apply output of a final encoding layer as additional input to the final encoding layer for one or more iterations in order to generate new output from the final encoding layer.

12. The computing system of claim 11, wherein the computing system where the final encoding layer is a decoding layer.

13. The computing system of claim 12, wherein a portion of the tokens are masked prior to generating the input data and a wherein the output of the final encoding layer corresponding to portion of the token that are masked prior to generating the input data is replaced with a corresponding absolute position embedding vector prior to being applied as the additional input.

14. The computing system of claim 12, wherein the computing system applies one or more hidden vector outputs from the decoding layer as the additional input, wherein the one or more hidden vectors correspond to tokens that are masked prior to generating the input data.

15. The computing system of claim 13, wherein the computing system applies a query vector output of the decoding layer as additional input to the decoding layer for one or more iterations in order to generate new output from the final decoding layer.

16. The computing system of claim 11, wherein the encoder includes a self-attention sub-layer and a feed forward sub-layer.

17. The computing system of claim 16, wherein the computing system generates and applies an attention score at the self-attention sub-layer for disentangling position bias embedding from content embedding associated with the plurality of the tokens.

18. The computing system of claim 17, wherein the attention score comprises a summation of at least: a first attention score component comprising a product of a content embedding of a first token in a token pair, a first learn-able content parameter, a second learn-able content parameter, and a content embedding of a second token in a token pair; a second attention score component comprising a product of the content embedding of the first token in the token pair, the first learn-able content parameter, a first learnable position parameter, and a relative position embedding of the first and second tokens in a token pair; and a third attention score component comprising a product of the relative position embedding of the first and second tokens in the token pair, a second learn-able position parameter, the second learn-able content parameter, and the second content embedding of a second token in a token pair.

19. The computing system of claim 18, wherein the computing system generates a separate attention score for each token pair of a plurality of token pairs.

20. A method of using encoding data with a transformer that is configured to account for position bias and improve pre-training convergence, the method including: identifying a plurality of tokens to be encoded from a sequence; obtaining a transformer that includes an encoder, comprising a plurality encoding layers within the encoder wherein each of the plurality of encoding layers includes a self-attention sub-layer and a feed forward sub-layer; embedding the plurality of tokens to generate input data, the plurality of tokens having position bias embedding; applying the input data to the encoder; generating and applying an attention weight for disentangling the position bias embedding and for modifying one or more learnable parameters used by the encoder; and modifying the one or more learnable parameters used by the encoder based on output generated by the encoder applying the attention weight to the input data.

21. The method of claim 20, wherein the attention weight comprises a summation of at least the following: a first attention component based on a content embedding of the first token in a token pair as well as a content embedding of a second token in the token pair; a second attention component based on the content embedding of the first token in the token pair and a relative position embedding of the first and second tokens in the token pair; and a third attention component based on the relative position embedding of the first and second tokens in the token pair and the second content embedding of the second token in the token pair.

Description

BRIEF DESCRIPTION OF THE DRAWINGS

[0033] In order to describe the manner in which the above-recited and other advantages and features can be obtained, a more particular description of the subject matter briefly described above will be rendered by reference to specific embodiments which are illustrated in the appended drawings. Understanding that these drawings depict only typical embodiments and are not therefore to be considered to be limiting in scope, embodiments will be described and explained with additional specificity and detail through the use of the accompanying drawings in which:

[0034] FIG. 1 illustrates absolute position of tokens in a series.

[0035] FIG. 2 illustrates relative position of tokens in a series.

[0036] FIG. 3 illustrates a table demonstrating the performance of several natural language processors in standard tests.

[0037] FIG. 4 illustrates a table demonstrating the performance of several natural language processors in standard tests.

[0038] FIG. 5 illustrates a computer system that includes and/or that can be used to perform the disclosed functionality.

[0039] FIG. 6 illustrates an encoder of a simplified transformer.

[0040] FIG. 7A illustrates an encoder of a transformer where position bias is disentangled at the self-attention level.

[0041] FIG. 7B illustrates an encoder of a transformer where position bias is disentangled at the self-attention level.

[0042] FIG. 8 illustrates a flowchart of acts associated with methods performed by a computing system.

[0043] FIG. 9 illustrates an encoder composed of multiple encoding layers.

[0044] FIG. 10 illustrates an encoder with a final decoding layer.

[0045] FIGS. 11-12 illustrate flowcharts of acts associated with methods performed by a computing system.

DETAILED DESCRIPTION

[0046] The embodiments disclosed herein introduce multiple techniques for improvement of a pre-trained language model that utilizes a transformer to encode data. At least one disclosed embodiment proposes a new model structure for pre-training a language model, referred to as Decoding-enhanced BERT with disentangled attention (DeBERTa).

[0047] As described herein, some of the disclosed embodiments capture the position information of a token within a series while encoding data with a transformer. Some embodiments disentangle content and positional information of each token within a series by applying an attention score that independently accounts for relative position bias embedding and content embedding. In at least one embodiment, the computing system disentangles the position bias embedding from the content embedding by applying an attention weight which is a summation of three attention score components with disentangled projection matrices, namely, a content-to-content component, a content-to-position component, and a position-to-content component. Although not required, in some instances, a position-to-position component is also included in the attention weight summation.

[0048] Some disclosed embodiments utilize self-training to pre-train and modify the models generated by the encoder. In some instances, self-training is conducted through masked language modeling. In some embodiments, the self-training is enhanced by utilizing a multi-step decoding to better reconstruct masked tokens and improve pre-training convergence. In at least one embodiment, multi-step decoding is performed by obtaining output from a final layer of an encoder and applying the output as additional input to the final layer of the encoder to generate new output.

[0049] The disclosed embodiments provide technical benefits in the industry by providing improved methods and systems for utilizing a transformer to facilitate the analysis of sequential dependencies existing between tokens in a series being processed by machine learned models. These embodiments disentangle the position information from the content information within the attention matrices used by the machine learned models so that the models can account for position information more accurately and to thereby improve the relevance and usefulness of the models.

[0050] Technical benefits of the disclosed embodiments also include improvements in the pre-training convergence that occurs during self-training of the models by utilizing a multi-step decoder layer with a transformer that is better equipped to reconstruct masked tokens more effectively (e.g. by using less resources), than existing systems, and to thereby improve the self-training processes applied with/to the corresponding models.

[0051] Some embodiments combine one or more features of the foregoing embodiments to further promote improvements in both pre-training efficiency (e.g. convergence) and model effectiveness (e.g. accuracy, relevance, or usefulness) for downstream tasks.

[0052] FIGS. 3 and 4 highlight some of the performance gains that can be made with the disclosed embodiments, DeBERTa, relative to other conventional PLMs that are used to perform NLP tasks. FIG. 3 specifically contrasts performance of DeBERTa (the disclosed embodiment that disentangles position information) with the performance of other conventional BERT type large models. FIG. 4, on the other hand, contrasts the performance of DeBERTa with other conventional BERT type base models. Testing has shown, for example, that the disclosed embodiments can consistently and significantly outperform RoBERTa.sub.large on a wide range of established tasks. For example, the disclosed embodiments demonstrate improvement in the MNLI by 0.9% (90.2 vs. 91.1), the SQuAD V2 task by 2.3% (88.4 vs. 90.7), and the RACE task by 3.6% (83.2% vs. 86.8%).

[0053] FIG. 5 illustrates one implementation of a computer system that incorporates and/or that can be used to perform the disclosed embodiments, such as, generating an NLP or PLM by encoding data with a transformer. As shown, the computing system 500 includes one or more processors 510 and one or more hardware storage devices 520 storing computer executable instructions that, when executed by the one or more processors 510, cause the computing system to perform the functionality described herein.

[0054] In some instances, the computing system 500 obtains training data 522 comprising a plurality of tokens, which may include any combination of words, non-words, parts of words, or multiple words (including phrases and sentences). In some instances, the tokens are text or speech. In some instances, the computing system 500 obtains the plurality of tokens from a third-party computing system 550 through a network connection 555. In some embodiments, the training data 522 comprises publicly available resources, such as Wikipedia, or a dataset of books from a third party source.

[0055] The computing system 500 then identifies the plurality of tokens from the training data 522 and embeds the plurality of tokens with an embedder 530 to generate input data 524. The input data, in some instances, comprises embedded tokens, with each embedded token comprising a vector or matrix representation of the original token. Additionally, in some embodiments, the input data comprises multiple vectors or matrices which represent different aspects of the token, such as, a position embedding vector or a content embedding vector of the token. In some instances, the input data for a token comprises a single vector which is a summation of the token's content embedding and position embedding. The computing system 500 then applies the input data 524 to a transformer 540 with one or more encoders 542, 544.

[0056] The transformer 540 with one or more encoders 542, 544 processes the input data to generate an output, as disclosed in more detail in reference to FIGS. 6, 7, 9, and 10. In some embodiments, the output is applied to a softmax layer which is used to predict missing tokens by generating probabilities of one or more likely desired tokens. In some instances, the computing system utilizes the output generated by the transformer to build or train an PLM model. For example, the output from the transformer may be used to provide a pre-trained language model 526.

[0057] FIG. 6 illustrates a simplified encoder 610 that may be utilized by the computing system 500. A computing system identifies a plurality of tokens in a series as input 601 to be embedded as the input embedding 602.

[0058] The input embedding 602 is applied to the encoder 610 comprising a self-attention sub-layer 612 and a feed-forward sub-layer 614. The self-attention sub-layer 612 includes an attention mechanism that generates and applies an attention score 620. In some instances, the encoder 610 includes a plurality of encoding layers, each with its own feed-forward sub-layer and self-attention sub-layer. Encoders with multiple encoding layers will be discussed in more detail in regard to FIGS. 9 and 10.

[0059] In some systems, position information is encoded as the positional encoding 603. In these systems, the positional encoding 603 is embedded with the input embedding 602 prior to applying the input embedding 602 to the encoder 610.

[0060] The attention score in a typical transformer is calculated as shown below:

[00001] $Q = W_{q} H$ $K = W_{k} H$ $V = W_{v} H$ $A = \frac{Q_{Q} K^{T}}{\sqrt{d}}$ $H_{o} = softmax (A) V$

where H∈R.sup.N×d represents hidden input vectors; H.sub.o∈R.sup.N×d represents the outputs of self-attention including content embedding; Q, K, and V denotes the query, key, and value vectors or matrices; W.sub.q,W.sub.k,W.sub.v∈R.sup.d×d represent the projection matrices; A∈R.sup.N×N represents the attention matrix; N represents the length of the input token sequence; and d represents the dimensions of the hidden state.

[0061] However, conventional systems that apply the position embedding to the input data, prior to applying the input data to the encoder, using the aforementioned attention score formulas, weaken the position information, as discussed previously.

[0062] To address the potential weakening of the positional information, some embodiments apply the positional bias information within the encoder 610 by applying the position information at the self-attention sub-layer 612 through the use of attention score 620 which incorporates both content and position embedding to the self-attention sub-layer 612. Some existing systems attempt to provide disentanglement of the attention score 620 by dividing the attention score 620 into four components (e.g., a content-to-content attention score, a content-to-position attention score, a position-to-content attention score, and a position-to-position attention score). Some existing systems implement an attention score that utilizes position information in one or more attention score components. However, even though the attention score 620 has been at least partially disentangled into four components, those components still utilize the same projection matrices. Even more particularly, the same projection matrices W.sub.q and W.sub.k are used for both the content embeddings and the position embeddings. Thus, the content information and position information still remain relatively entangled with this conventional implementation.

[0063] At least one embodiment, the attention weight is calculated using the formula below. In at least one embodiment, the embedder 530 generates two embedded input vectors which represent a single token, a content vector {H.sub.i} and a relative position vector{P.sub.i,j}. In this manner, the attention weight of a word pair can be calculated as the sum of four attention score components, namely, content-to-content, content-to-position, position-to-content, and position-to-position, as shown below:

A.sub.i,j={H.sub.i,P.sub.i|j}×{H.sub.j,P.sub.i|j}.sup.T=H.sub.iH.sub.j.sup.T+H.sub.iP.sub.i|j.sup.T+P.sub.i|jH.sub.j.sup.T+P.sub.i|jP.sub.j|i.sup.T

[0064] As described in the foregoing, a content embedding signal is typically stronger than a position embedding signal. Therefore, by utilizing the same projection matrices for both content encoding and position encoding, the positional bias information may be overwhelmed or lost within an encoder, particularly by an encoder that utilizes stacked encoding layers which, thereby, weakens the sequence dependency of the attention mechanism and effectively limits the accuracy of the resulting model. Therefore, improvements over such conventional systems are needed.

[0065] FIG. 7A illustrates one embodiment of an improved encoder 710a that effectively disentangles the content embedding from the position embedding within the attention score 720a. The attention score 720a is fully disentangled in this disclosed implementation by introducing learn-able projection matrices specific to both the content embedding (e.g., W.sub.q,X, W.sub.k,X), as well as the position embedding (e.g., W.sub.q,P, W.sub.k,P)/

[0066] The attention score 720a of FIG. 7A, for example, with a full set of disentangled projection matrices, is calculated with the following formula, in which P.sub.i,j denotes the relative distance (e.g. relative position) between tokens i and j, such as referenced in FIG. 2:

A.sub.i,j=H.sub.iW.sub.q,HW.sub.k,H.sup.TH.sub.j.sup.T+H.sub.iW.sub.q,hW.sub.k,P.sup.TP.sub.i,j.sup.T+P.sub.i,jW.sub.q,PW.sub.k,H.sup.TH.sub.j.sup.T+P.sub.iW.sub.q,PW.sub.k,P.sup.TP.sub.i,j.sup.T

[0067] As shown in this implementation, the attention score 720a is a summation of four distinct attention score components, namely, a content-to-content component H.sub.iW.sub.q,HW.sub.k,H.sup.TH.sub.j.sup.T, a content-to-position component H.sub.iW.sub.q,HW.sub.k,P.sup.TP.sub.i,j.sup.T, a position-to-content component P.sub.i,jW.sub.q,PW.sub.k,H.sup.TH.sub.j.sup.T, and a position-to-position component P.sub.iW.sub.q,PW.sub.k,P.sup.TP.sub.i,j.sup.T. By utilizing this set of distinct components, the attention score is more fully disentangled relative to prior systems and in such a manner that the consideration of the position information will persist throughout the encoding processing, independently and disentangled from the corresponding content information, and without being negated or muted during the processing due to entanglement within the transformer.

[0068] Additionally, in at least one embodiment, the computing system applies learn-able relative position encoding instead of fixed sinusoid encoding. In at least one embodiment P.sub.i,j is a learn-able parameter. In at least one embodiment, the fourth component (P.sub.iW.sub.q,PW.sub.k,P.sup.TP.sub.i,j.sup.T) is a global position-to-position bias which is independent of the content embedding. It will be appreciated that this independent global position-to-position bias is novel from other models that incorporate relative position bias. In at least one additional or alternative embodiment, the global position-to-position bias utilizes the absolution position of token i, denoted as P, in the application of the aforementioned formula.

[0069] In some embodiments, the encoder 710a generates an output that can be utilized by the computing system to train or build a PLM model. In some embodiments, the encoder output is applied to a decoder to make NLP predictions and/or to perform other NLP or machine learning operations.

[0070] In at least one embodiment, the attention weight utilizes a maximum relative distance for the relative position embedding and shown in the formula below:

[00002] $δ (i, j) = {\begin{matrix} 0 for i - j \leq - k \\ 2 k - 1 for i - j \geq k \\ i - j + k others \end{matrix}$

where k represents the maximum relative distance and δ(i,j)∈[o, 2k) represents the relative distance from token i to token j.

[0071] FIG. 7B illustrates a related embodiment in which the encoder 710b utilizes an attention score 720b that omits the global position-to-position component from the summarization of the other three components. For instance, the disentangled self-attention score in this embodiment is calculated by the following formula:

Ã.sub.i,j=Q.sub.i.sup.cK.sub.j.sup.cT+Q.sub.i.sup.cK.sub.δ(i,j).sup.rT+K.sub.j.sup.cQ.sub.δ(j,i).sup.rTT

where where Ã.sub.i,j is the element of attention matrix, representing the attention score from token i to token j. In addition, Q.sub.i.sup.c is the i.sup.th row of Q.sub.c, K.sub.j.sup.c is the j.sup.th row of K.sub.c, K.sub.δ(i,j).sup.r is the δ(i, j).sup.th row of K.sub.r with regard to relative distance δ(i, j), and Q.sub.δ(j,i).sup.r is the δ(i, j).sup.th row of Q.sub.r with regard to relative distance δ(i, j).

[0072] In this formula, Q.sub.c=HW.sub.q,c, K.sub.c=HW.sub.k,c, V.sub.c=HW.sub.v,c, Q.sub.r=HW.sub.q,r, K.sub.r=HW.sub.k,r, wherein Q.sub.c, K.sub.c and V.sub.c are the projected content vectors generated using projection matrices W.sub.q,c, W.sub.k,c, W.sub.v,c∈R.sup.d×d, respectively, and P∈R.sup.2k×d represents the relative position embedding vectors shared across all layers (i.e. staying fixed during forward propagation), and Q.sub.r and K.sub.r are the projected relative positions vectors generated using projection matrices W.sub.q,r, W.sub.k,r∈R.sup.d×d, respectively.

[0073] Finally, in at least one embodiment, a scaling factor of

[00003] $\frac{1}{\sqrt{3 d}}$

on Ā, is also applied to stabilize the model training for large-scale PLMs, as represented by

[00004] $H_{o} = softmax (\frac{\overline{A}}{\sqrt{3 d}}) V_{c} .$

[0074] With regard to the calculation of attention weight and attention score, it is desirable in some instances to reduce the space complexity and memory requirements needed by the computing system 500 to complete the attention weight or attention score computations. In some embodiments, improvements to the space complexity and reductions in memory requirements include causing the computing system 500 to refrain from storing a relative position embedding for each query. Instead, in some instances, the computing system utilizes one or more subsets of the key, value, or query vectors (e.g. K, V, or Q) to extract the relative position embedding and calculate the attention score for all queries. In at least one embodiment, the computing system 500 utilizes the relative distance δ as an index in extracting attention weights while utilizing a subset of either the key, value, or query vector. An example of an efficient implementation of a disentangled attention, and a corresponding algorithm, is included in U.S. Provisional Patent Application Ser. No. 63/035,315, filed on Jun. 5, 2020, and entitled “DEBERTA: DECODING-ENHANCED BERT WITH A DISENTANGLED ATTENTION,” which has been incorporated by reference in its entirety.

[0075] FIG. 8 illustrates a flow chart of the various acts associated with the disclosed systems and methods where the computing system disentangles position information from content information in the self-attention sublayer 720b while encoding data with a transformer.

[0076] It will be appreciated, with regard to the flow charts shown in FIGS. 8, 11 and 12, that the following discussion now refers to a number of methods and method acts that may be performed. Although the method acts may be discussed in a certain order or illustrated in a flow chart as occurring in a particular order, no particular ordering is required unless specifically stated or required because one act is dependent on another act being completed prior to the act being performed.

[0077] As shown, the disclosed embodiments include a computer system identifying a plurality of tokens to be encoded (act 810) and embedding the plurality of tokens to generate input data (act 820). The computing system will obtain a transformer with one or more encoders (act 830). The computing system then applies the input data to the encoder, the encoder comprising a self-attention sub-layer and a feed-forward sub-layer (act 840). Finally, the computing system obtains and uses the encoder output, after generating and applying an attention score to the self-attention sub-layer, by at least disentangling position bias information from content information that is associated with the input data (act 850).

[0078] FIG. 9 illustrates one embodiment of an encoder 910 that comprises a plurality of encoding layers 911, 912, 913, 914, 915, etc. Each encoding layer 911, 912, 913, 914, 915 comprises a self-attention sub-layer and a feed-forward sub-layer as disclosed in regard to FIGS. 6 and 7. The ellipsis 916 indicate that the encoder could have any number of encoding layers. In some embodiments, the encoder has six encoding layers. In some embodiments the encoder has twelve encoding layers. In some embodiments the encoder has 24 encoding layers.

[0079] In some embodiments, the first encoding layer 911 applies as input, the input embedding as illustrated in FIG. 6 and described in the corresponding discussion. In some embodiments each encoding layer (912, 913, 914, 915, etc.) following the first encoding layer 911 applies, as input, the output of the encoding layer below it. For example, encoding layer 912 applies, as input, the output of encoding layer 911 and so forth. In some embodiments, one or more encoding layers may apply as input, or as additional input, the output of an any encoding layer.

[0080] In many instances, a computing system may utilize the output of the final encoding layer 915 to self-train a model produced by the transformer. For example, BERT utilizes the output of final encoding layer 915 to self-train the model. Generally, BERT utilizes a Masked language model (MLM) to enables a computing system to learn bi-directional representations of natural language and to self-train the PLM. A standard BERT pre-training consists of applying final hidden vectors from the final encoding layer 915, the hidden vectors corresponding to the masked tokens, to an output softmax 917 over vocabulary to reconstruct the masked tokens. The computing system then trains and updates the model based on the accuracy (e.g. correct and incorrect predictions) of the predictions.

[0081] The following discussion is a more detailed description of how BERT reconstructs masked tokens in self-training. The typical output of final encoder layer l∈[0, L) is shown below:

Ō.sup.l=Attention(H.sup.l-1W.sub.q.sup.l,H.sup.l-1W.sub.k.sup.l,H.sup.l-1W.sub.v.sup.l)

O.sup.l=LayerNorm(Linear(Ō.sup.l)+H.sup.l-1)

H.sub.l=LayerNorm(PosFNN(O.sup.l)+O.sup.l)

where L is the total number of transformer layers and H.sup.l={h.sub.i.sup.l} is the output of the l.sup.th layer with h.sub.i.sup.l corresponding to hidden state of the i.sup.th token.

[0082] For example, the objective of the computing system is to reconstruct corrupted tokens {{circumflex over (x)}.sub.l} from X, utilizing the output of the final encoding layer 915, where a portion of a sequence X is randomly corrupted as {circumflex over (x)}.sub.l, and now represented as X (e.g. the corrupted version of X). Masked token reconstruction is shown below:

[00005] $\max_{θ} \log p_{θ} (X .Math. \overline{X}) = \underset{i \in K}{.Math.} \log p_{θ} (\hat{x_{.Math.}} = x_{i} .Math. \overline{X}) = \underset{i \in K}{.Math.} \log \frac{e^{h_{i}^{L - 1}} .Math. e_{i}}{{.Math.}_{j} e^{h_{i}^{L - 1}} .Math. e_{j}}$

where K is the set of indices of masked tokens in the sequence, e.sub.i is the embedding of i.sup.th token in the sequence, e.sub.j is the embedding of j.sup.th token in the whole vocabulary, and h.sub.i.sup.L-1 is the hidden state of masked token in the output of last transformer layer.

[0083] Some embodiments are directed at an Enhanced Mask Decoder (EMD) that improves pre-training of the PLM by introducing task-specific decoding layers in order to mitigate mismatch between pre-training and fine-tuning (e.g. task-specific training of a pre-trained model). An EMD is a task-specific decoder designed to reconstruct masked token of an MLM. In some embodiments, the EMD has a plurality of decoding layers. In at least one embodiment, one or more hidden vector(s) that were output from a decoder layer are reapplied to the decoder layer to generate a new output from the decoder layer. In some embodiments, when the desired final output is probabilities, the output of the decoding layers is applied to a softmax layer.

[0084] In some embodiments, features of BERT and EMD are combined, so that the resulting model has an encoder-decoder transformer structure with multiple layers. For example, in some instances, the transformer has 12 layers with 12 attentions heads. In other instances, the transformer has 24 layers and 16 attention heads. In such instances, the encoder may have the same quantity of layers as the decoder or more layers than the decoder.

[0085] FIG. 10 illustrates one embodiment of a transformer architecture with an EMD. In at least one embodiment, the first N−1 layers of the encoder are e-encoder layers 1012, 1013, 1014, 1015, and 1016. In this configuration, N denotes the total number of encoding layers within the encoder 1010 and the final layer is an EMD or an e-decoder layer 1011. Ellipsis 1017 illustrate other embodiments that may have any number of encoding layers 1012, 1013, 1014, 1015, 1016. In at least one embodiment, there are eleven e-encoding layers. In at least one embodiment, there are 23 e-encoding layers 1012, 1013, 1014, 1015, 1016.

[0086] In at least one embodiment, the e-decoder 1011 is used to improve the self-training of the PLM model. In at least one embodiment, the e-decoder 1011 is used to improve the self-training of an MLM. In at least one embodiment, the e-decoder is used to produce token-wise contextual embeddings that are used to reconstruct the masked tokens in the MLM.

[0087] In at least one embodiment, some or all of the output of the e-decoding layer 1011 is reapplied to the e-decoding layer 1011 as additional input. Therefore, in at least one embodiment, the e-decoding layer 1011 obtains its inputs from the output of the decoding layer 1011, as well as the output of from one or more encoding layers (e.g. encoding layer 1016). In at least one embodiment, the computing system 500 obtains output from the e-decoder layer 1011 comprising hidden vectors, correlating to the masked tokens, and applies one or more hidden vectors as additional input to the e-decoder 1011 to generate new output from the e-decoder 1011.

[0088] In some embodiments, the computing system obtains one or more projected content vectors and/or projected relative position vectors from the output of the e-decoder 1011 and applies one or more projected content vectors and/or projected relative position vectors to the input of e-decoder 1011 to generate new output from the e-decoder 1011. For example, in one embodiment the computing system obtains a queries matrix (Q) from the output of the e-decoding layer 1011 and computing system 500 obtains the key (K) and value (V) matrix from the final E-encoder layer 1016. The Q K, and V matrices are then applied as input to the e-decoding layer 1011. In this manner, the Q output of the e-decoder layer 1011 can be used by the computing system to generate a new Q from the e-decoder layer, or the Q output can be used to by the computing system to self-train an PLM model. In at least one embodiment the Q output from the e-decoder is utilized to reconstruct masked tokens for an MLM.

[0089] In some embodiments, the system iteratively applied the Q output of the e-decoder as additional input to the e-decoder 1011. This may occur numerous times (e.g., 2, 3, 4 . . . 10 . . . 20, 20+ times, or any number of times). In this manner, the same K and V of the final E-encoder layer 1016 will be applied to the e-decoder in each iteration of the multi-step method, while the Q output from the e-decoder 1011 will update during each iteration. Each updated Q is applied as additional input to the E-decoding layer 1011 to generate a new output from the e-decoder 1011. In at least one embodiment, the Q output from the e-decoder 1011 is applied only once to the E-decoder as additional input to generate new e-decoder output.

[0090] In at least one embodiment, only the hidden state of masked tokens h.sub.i.sup.L-1 are used during the calculation of the MLM loss. By ignoring the last two components of the formula, the output is actually a weighted sum of the output of the e-encoder layers 1012, 1013, 1014, 1015, 1016 with an attention score as the weight. From this point of view, at least one embodiment includes a multi-step e-decoder that causes the e-encoding layers 1012, 1013, 1014, 1015, 1016 to learn a better representation of the input sequence X=x.sub.i and that causes the e-decoding layer 1011 to reconstruct the corrupted tokens more accurately through multiple steps as shown in FIG. 10.

[0091] During this multi-step e-decoder can utilize the following formula:

Q.sup.s-1=H.sub.de.sup.s-1

Ō.sup.s=Attention(Q.sup.s-1W.sub.q.sup.L-1,H.sub.en.sup.n-1W.sub.k.sup.L-1,H.sub.en.sup.n-1W.sub.v.sup.L-1

O.sup.s=LayerNorm(Linear(Ō.sup.s)+Q.sup.s-1)

H.sub.de.sup.s=LayerNorm(PosFNN(O.sup.s)+O.sup.s)

where H.sub.de.sup.s={h.sub.dei.sup.s}.sub.i∈K is the output of decoding step s, H.sub.de.sup.s-1 is the output of the last layer of encoder with a total layers of n=L−1, and when s=0.

[0092] When this formula is applied to the pre-trained model for downstream task adaptation, at least one embodiment uses one-step task head to query over the output of last encoder layer, H.sup.n-1.

[0093] In at least one embodiment, the e-decoder layer 1011 is used to reconstruct the masked tokens of an MLM in denoise mode, as opposed to the typical auto-regressive mode. “Denoise mode” is the masked language modeling training method described earlier, and “auto-regressive mode” is a training method where an NLP or PLM attempts to predict a missing token sequentially.

[0094] In at least one embodiment, the output of the e-decoder 1011 is applied to a softmax layer to provide probabilities and reconstruct masked tokens.

[0095] The multi-step e-decoding layer 1011 has multiple technological advantages over existing systems. First, when compared with the single-step approach, the final output Q has a deeper understanding of the original K and V from varied and different perspectives, similar to the idea of multi-step reasoning. This can lead to better self-training, such as a better prediction of the masked tokens and improves convergence of the model. Second, it can push more objective-oriented information back to the static K and V during training and modification of the model (e.g. back-propagation). Because K and V have interacted with Q multiple times in the forward propagation, the accumulating gradients of K and V better capture the feedback signal in the objective function from all the steps. The multi-step e-decoder layer 1011 can help the computing system to learn a better representation for all the e-encoder layers.

[0096] The following discussion related to the specifics of how an MLM is typically implemented and embodiments which improve on the MLM. Generally, BERT utilizes MLM by masking fifteen percent (15%) of tokens or words within a series, processing the language of the series, attempting to predict the masked words, and then unmasking the words and updating the model to make better predictions. Most of the randomly selected masked tokens are replaced with a [MASK] token. However, ten percent (10%) of the tokens selected for masking remain unchanged in order to mitigate the mismatch between pre-training (self-training) and fine-tuning. However, this method is limited by information leaking (i.e. predicting a masked token conditioned on the token itself). Accordingly, improvements over such masking techniques is desired.

[0097] Disclosed embodiments can provide improvements over the foregoing techniques, in some instances, by replacing a portion of the output from the final layer e-encoder 1016 with new inputs. For instance, in at least one embodiment, the portion of the output of layer 1016 is replaced with the corresponding absolute position embedding vectors prior to being applied to the e-decoder layer 1011, wherein the portion of the output of layer 1016 corresponds to the masked (and unchanged) tokens in an MLM. This can help prevent the aforementioned information leaking.

[0098] FIG. 11 illustrates a flow chart of the various acts associated with the disclosed methods in which the computing system utilizes a multi-step e-decoding layer 1011 while encoding data with a transformer to improve self-training of an PLM model.

[0099] The computing system first identifies a plurality of tokens to be encoded (act 1110) and embeds the plurality of tokens to generate encoder input data (act 1120). The computing system also obtains a transformer with one or more encoders (act 1130). The computing system then applies the input data to the encoder comprising a plurality of encoding layers (act 1140). The computing system then applies an output of a final encoding layer as additional input to the final encoding layer to generate a new output from the encoding layer (act 1150). Finally, the computing system will obtain and use the new output from the encoder (act 1160).

[0100] FIG. 12 illustrates a flow chart of the various acts associated with the disclosed methods in which the computing system disentangles position information from content information in the self-attention sublayer 720b and utilizes a multi-step E-decoding layer 1011 while encoding data with a transformer to capture position information and improve self-training of an PLM model.

[0101] The computing system first identifies a plurality of tokens to be encoded (act 1210) and embeds the plurality of tokens to generate encoder input data (act 1220). The computing system also obtains a transformer with one or more encoders (act 1230). The computing system then applies the input data to the encoder comprising a plurality of encoding layers where each of the plurality of encoding layers comprise a self-attention sub-layer and a feed-forward sub-layer (act 1240). The computing system will then generate and apply an attention score to one or more self-attention sub-layer for one or more encoding layer by at least disentangling position bias information from content information that is associated with the input data (act 1250). The computing system then applies an output of a final encoding layer as additional input to the final encoding layer to generate a new output from the encoding layer (act 1260). Finally, the computing system will obtain and use an output from the encoder (act 1270).

[0102] It will be appreciated, with regard to the foregoing, that the disclosed embodiments may be incorporated in and/or by a computer system that includes one or more processors and computer-readable media such as computer memory or other hardware storage devices that store computer-executable instructions that when executed by one or more processors cause the various disclosed functions to be performed.

[0103] The disclosed embodiments also include physical and other computer-readable media for carrying or storing computer-executable instructions and/or data structures. Such computer-readable media can be any available media that can be accessed by a general purpose or special purpose computer system. Computer-readable media that store computer-executable instructions are hardware storage media. Computer-readable media that carry computer-executable instructions are transmission media. Thus, by way of example, and not limitation, embodiments of the invention can comprise at least two distinctly different kinds of computer-readable media: (1) physical computer-readable hardware storage media and (2) transmission computer-readable media, which are distinct and different from each other.

[0104] Physical computer-readable hardware storage media includes RAM, ROM, EEPROM, CD-ROM or other optical disk storage (such as CDs, DVDs, etc.), magnetic disk storage or other magnetic storage devices, or any other hardware storage devices which can be used to store desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer.

[0105] A “network” is defined as one or more data links that enable the transport of electronic data between computer systems and/or modules and/or other electronic devices. When information is transferred or provided over a network or another communications connection (either hardwired, wireless, or a combination of hardwired or wireless) to a computer, the computer properly views the connection as a transmission medium. Transmissions media can include a network and/or data links which can be used to carry or desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer. Combinations of the above are also included within the scope of computer-readable media.

[0106] Further, upon reaching various computer system components, program code means in the form of computer-executable instructions or data structures can be transferred automatically from transmission computer-readable media to physical computer-readable storage media (or vice versa). For example, computer-executable instructions or data structures received over a network or data link can be buffered in RAM within a network interface module (e.g., a “NIC”), and then eventually transferred to computer system RAM and/or to less volatile computer-readable physical storage media at a computer system. Thus, computer-readable physical storage media can be included in computer system components that also (or even primarily) utilize transmission media.

[0107] Computer-executable instructions comprise, for example, instructions and data which cause a general-purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions. The computer-executable instructions may be, for example, binaries, intermediate format instructions such as assembly language, or even source code. Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the described features or acts described above. Rather, the described features and acts are disclosed as example forms of implementing the claims.

[0108] Those skilled in the art will appreciate that the invention may be practiced in network computing environments with many types of computer system configurations, including, personal computers, desktop computers, laptop computers, message processors, hand-held devices, multi-processor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, pagers, routers, switches, and the like. The invention may also be practiced in distributed system environments where local and remote computer systems, which are linked (either by hardwired data links, wireless data links, or by a combination of hardwired and wireless data links) through a network, both perform tasks. In a distributed system environment, program modules may be located in both local and remote memory storage devices.

[0109] Alternatively, or in addition, the functionality described herein can be performed, at least in part, by one or more hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include Field-programmable Gate Arrays (FPGAs), Program-specific Integrated Circuits (ASICs), Program-specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), etc.

[0110] The present invention may be embodied in other specific forms without departing from its spirit or characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive. The scope of the invention is, therefore, indicated by the appended claims rather than by the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.

EFFICIENT TRANSFORMER LANGUAGE MODELS WITH DISENTANGLED ATTENTION AND MULTI-STEP DECODING

Inventors

Cpc classification

Classification Explorer

G06N3/088

PHYSICS

Classification Explorer

G06N20/00

PHYSICS

Classification Explorer

G06N3/045

PHYSICS

Classification Explorer

G06F40/40

PHYSICS

Classification Explorer

G06F40/237

PHYSICS

Classification Explorer

G06N3/048

PHYSICS

Classification Explorer

G06F40/30

PHYSICS

International classification

Classification Explorer

G06F40/40

PHYSICS

Classification Explorer

G06N20/00

PHYSICS

Abstract

Claims

Description