Structured adversarial, training for natural language machine learning tasks
11544472 · 2023-01-03
Assignee
Inventors
- Edward P. Stabler (San Jose, CA, US)
- Benjamin Goldsmith (San Jose, CA, US)
- Hendrik Harkema (Santa Clara, CA, US)
Cpc classification
G06F18/214
PHYSICS
G06F18/2148
PHYSICS
G06V10/768
PHYSICS
International classification
Abstract
A method includes obtaining first training data having multiple first linguistic samples. The method also includes generating second training data using the first training data and multiple symmetries. The symmetries identify how to modify the first linguistic samples while maintaining structural invariants within the first linguistic samples, and the second training data has multiple second linguistic samples. The method further includes training a machine learning model using at least the second training data. At least some of the second linguistic samples in the second training data are selected during the training based on a likelihood of being misclassified by the machine learning model.
Claims
1. A method comprising: obtaining first training data comprising multiple first linguistic samples, the first linguistic samples contained in first dialogue samples associated with a context; generating second training data using the first training data and multiple symmetries, the symmetries identifying how to modify the first linguistic samples while maintaining structural invariants within the first linguistic samples, the second training data comprising multiple second linguistic samples; generating second dialogue samples associated with the context, at least some of the second dialogue samples containing the second linguistic samples; and training a machine learning model using at least the second training data and the second dialogue samples, wherein at least some of the second linguistic samples in the second training data are selected during the training based on a likelihood of being misclassified by the machine learning model.
2. The method of claim 1, wherein the multiple symmetries comprise: substitution symmetries in which words or phrases in the first linguistic samples are replaced with other words or phrases; permutation symmetries in which words or phrases in the first linguistic samples are moved within the first linguistic samples; insertion/deletion symmetries in which words, phrases, or punctuations are added to or removed from the first linguistic samples; and character-level or word-level symmetries in which characters or words in the first linguistic samples are manipulated to create typographical or grammatical errors.
3. The method of claim 2, wherein: the substitution symmetries comprise at least one of: (i) replacing words or phrases in the first linguistic samples with equivalent words or phrases and (ii) substituting details in the first linguistic samples that are irrelevant to a task; the permutation symmetries comprise switching an order of words or phrases in the first linguistic samples; the insertion/deletion symmetries comprise at least one of: (i) inserting or removing articles or adjuncts in the first linguistic samples, (ii) inserting or removing politeness words or phrases in the first linguistic samples, and (iii) inserting or removing punctuation in the first linguistic samples; and the character-level or word-level symmetries comprise at least one of: (i) swapping characters or words in the first linguistic samples and (ii) adding or removing blank spaces in the first linguistic samples.
4. The method of claim 2, wherein the substitution symmetries comprise (i) replacing words or phrases in the first linguistic samples with equivalent words or phrases and (ii) substituting details in the first linguistic samples that are irrelevant to a task.
5. The method of claim 1, wherein generating the second training data comprises: applying the symmetries to the first linguistic samples to produce intermediate linguistic samples; filtering the intermediate linguistic samples to remove unnatural linguistic samples; and selecting one or more of the intermediate linguistic samples as the second linguistic samples for use in training the machine learning model, wherein the one or more selected intermediate linguistic samples (i) are relevant to a task associated with the first linguistic samples, (ii) lack new annotations relative to the first linguistic samples, and (iii) correct one or more misclassifications made by a prior version of the machine learning model.
6. The method of claim 1, wherein each of the structural invariants represents a linguistic object that cannot be replaced by another linguistic object while preserving how a meaning of an expression is determined, the structural invariants defined at least partially by a task to be learned by the machine learning model.
7. The method of claim 1, wherein generating the second dialogue samples comprises at least one of: reordering at least some of the first linguistic samples in the first dialogue samples while maintaining the context to provide permutation symmetry; and inserting one of the first dialogue samples into another of the first dialogue samples to provide interruption symmetry.
8. An apparatus comprising: at least one memory configured to store first training data comprising multiple first linguistic samples, the first linguistic samples contained in first dialogue samples associated with a context; and at least one processor configured to: generate second training data using the first training data and multiple symmetries, the symmetries identifying how to modify the first linguistic samples while maintaining structural invariants within the first linguistic samples, the second training data comprising multiple second linguistic samples; generate second dialogue samples associated with the context, at least some of the second dialogue samples containing the second linguistic samples; and train a machine learning model using at least the second training data and the second dialogue samples; wherein the at least one processor is configured to select at least some of the second linguistic samples in the second training data during the training based on a likelihood of being misclassified by the machine learning model.
9. The apparatus of claim 8, wherein the multiple symmetries comprise: substitution symmetries in which words or phrases in the first linguistic samples are replaced with other words or phrases; permutation symmetries in which words or phrases in the first linguistic samples are moved within the first linguistic samples; insertion/deletion symmetries in which words, phrases, or punctuations are added to or removed from the first linguistic samples; and character-level or word-level symmetries in which characters or words in the first linguistic samples are manipulated to create typographical or grammatical errors.
10. The apparatus of claim 9, wherein: the substitution symmetries comprise at least one of: (i) replacing words or phrases in the first linguistic samples with equivalent words or phrases and (ii) substituting details in the first linguistic samples that are irrelevant to a task; the permutation symmetries comprise switching an order of words or phrases in the first linguistic samples; the insertion/deletion symmetries comprise at least one of: (i) inserting or removing articles or adjuncts in the first linguistic samples, (ii) inserting or removing politeness words or phrases in the first linguistic samples, and (iii) inserting or removing punctuation in the first linguistic samples; and the character-level or word-level symmetries comprise at least one of: (i) swapping characters or words in the first linguistic samples and (ii) adding or removing blank spaces in the first linguistic samples.
11. The apparatus of claim 9, wherein the substitution symmetries comprise (i) replacing words or phrases in the first linguistic samples with equivalent words or phrases and (ii) substituting details in the first linguistic samples that are irrelevant to a task.
12. The apparatus of claim 8, wherein, to generate the second training data, the at least one processor is configured to: apply the symmetries to the first linguistic samples to produce intermediate linguistic samples; filter the intermediate linguistic samples to remove unnatural linguistic samples; and select one or more of the intermediate linguistic samples as the second linguistic samples for use in training the machine learning model, wherein the one or more selected intermediate linguistic samples (i) are relevant to a task associated with the first linguistic samples, (ii) lack new annotations relative to the first linguistic samples, and (iii) correct one or more misclassifications made by a prior version of the machine learning model.
13. The apparatus of claim 7, wherein, to generate the second dialogue samples, the at least one processor is configured to at least one of: reorder at least some of the first linguistic samples in the first dialogue samples while maintaining the context to provide permutation symmetry; and insert one of the first dialogue samples into another of the first dialogue samples to provide interruption symmetry.
14. The apparatus of claim 8, wherein each of the structural invariants represents a linguistic object that cannot be replaced by another linguistic object while preserving how a meaning of an expression is determined, the structural invariants defined at least partially by a task to be learned by the machine learning model.
15. A non-transitory computer readable medium containing instructions that when executed cause at least one processor to: obtain first training data comprising multiple first linguistic samples, the first linguistic samples contained in first dialogue samples associated with a context; generate second training data using the first training data and multiple symmetries, the symmetries identifying how to modify the first linguistic samples while maintaining structural invariants within the first linguistic samples, the second training data comprising multiple second linguistic samples; generate second dialogue samples associated with the context, at least some of the second dialogue samples containing the second linguistic samples; and train a machine learning model using at least the second training data and the second dialogue samples; wherein the instructions that when executed cause the at least one processor to generate the second training data comprise: instructions that when executed cause the at least one processor to select at least some of the second linguistic samples in the second training data during the training based on a likelihood of being misclassified by the machine learning model.
16. The non-transitory computer readable medium of claim 15, wherein the multiple symmetries comprise: substitution symmetries in which words or phrases in the first linguistic samples are replaced with other words or phrases; permutation symmetries in which words or phrases in the first linguistic samples are moved within the first linguistic samples; insertion/deletion symmetries in which words, phrases, or punctuations are added to or removed from the first linguistic samples; and character-level or word-level symmetries in which characters or words in the first linguistic samples are manipulated to create typographical or grammatical errors.
17. The non-transitory computer readable medium of claim 16, wherein: the substitution symmetries comprise at least one of: (i) replacing words or phrases in the first linguistic samples with equivalent words or phrases and (ii) substituting details in the first linguistic samples that are irrelevant to a task; the permutation symmetries comprise switching an order of words or phrases in the first linguistic samples; the insertion/deletion symmetries comprise at least one of: (i) inserting or removing articles or adjuncts in the first linguistic samples, (ii) inserting or removing politeness words or phrases in the first linguistic samples, and (iii) inserting or removing punctuation in the first linguistic samples; and the character-level or word-level symmetries comprise at least one of: (i) swapping characters or words in the first linguistic samples and (ii) adding or removing blank spaces in the first linguistic samples.
18. The non-transitory computer readable medium of claim 15, wherein the instructions that when executed cause the at least one processor to generate the second training data comprise: instructions that when executed cause the at least one processor to: apply the symmetries to the first linguistic samples to produce intermediate linguistic samples; filter the intermediate linguistic samples to remove unnatural linguistic samples; and select one or more of the intermediate linguistic samples as the second linguistic samples for use in training the machine learning model, wherein the one or more selected intermediate linguistic samples (i) are relevant to a task associated with the first linguistic samples, (ii) lack new annotations relative to the first linguistic samples, and (iii) correct one or more misclassifications made by a prior version of the machine learning model.
19. The non-transitory computer readable medium of claim 15, wherein the instructions that when executed cause the at least one processor to generate the second dialogue samples comprise: instructions that when executed cause the at least one processor to at least one of: reorder at least some of the first linguistic samples in the first dialogue samples while maintaining the context to provide permutation symmetry; and insert one of the first dialogue samples into another of the first dialogue samples to provide interruption symmetry.
20. The non-transitory computer readable medium of claim 15, wherein each of the structural invariants represents a linguistic object that cannot be replaced by another linguistic object while preserving how a meaning of an expression is determined, the structural invariants defined at least partially by a task to be learned by the machine learning model.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1) For a more complete understanding of this disclosure and its advantages, reference is now made to the following description taken in conjunction with the accompanying drawings, in which like reference numerals represent like parts:
(2)
(3)
(4)
(5)
(6)
(7)
(8)
DETAILED DESCRIPTION
(9)
(10) As noted above, many natural language machine learning models cannot be trained effectively without the use of an extensive amount of annotated training data, which can be expensive to collect. As a result, effectively training a natural language machine learning model is often difficult, even if a domain associated with the machine learning model is restricted. With limited training data, the machine learning model can incorrectly classify many inputs and frustrate users of an application that operates based on the model.
(11) Consider the following example in which a semantic parser for a home Internet of Things (IoT) application uses a language model. In this type of application, users provide commands to automation controllers or other devices in order to initiate performance of actions related to IoT devices in the users' homes. The commands may relate to various actions, such as turning lights on or off, activating or deactivating security systems, playing or pausing music or video content, controlling televisions or speaker systems, answering doorbells, increasing or decreasing air conditioning or heating temperatures, or performing other functions. Parsing the users' commands correctly here is important in order to satisfy the users' intents, but it is extremely common for users (or even the same user) to express common commands in different ways.
(12) One common type of command provided by IoT users is “if-then” commands, each of which includes (i) a condition and (ii) a command to be performed only if the condition is satisfied. As a particular example, a user may provide a verbal or typed command such as “if the door is open, turn on the light.” Ideally, a sematic parser would parse this command into something like “[CONDITION if the [DEVICENAME door] is [DEVICESTATE open]] [GOAL POWERON turn on the [DEVICENAME light]].” This parsing effectively identifies the condition as being related to the state of the door and the action as being related to the state of the light. However, if the user provides a verbal or typed command such as “if the door is open turn on the light” (without a comma separating the condition and the action), even state-of-the-art parsers can incorrectly parse this command. For instance, a parser may actually identify the door as being the subject of the “turn on” action, which fails to properly identify the condition and action in the command.
(13) Training a parser's language model with a dataset, even a large dataset, may not resolve these types of problems. While this particular problem (the lack of a comma) may be fixed by manually supplementing training data to include a command without a comma, there are numerous similar types of problems that can be experienced in day-to-day usage of natural language. While humans are easily capable of ignoring structurally-small variations in speech or text, extensively-trained cloud-based parsers or other natural language machine learning systems can routinely fail to properly classify inputs having structurally-small variations, even when a domain of interest is quite narrow.
(14) Not only that, many applications could be improved by supporting language interfaces that recognize dialogue contexts, particularly when a dialogue introduces new terms. A natural language model that forgets context from one utterance to the next can frustrate users. Also, it may be desirable in extreme cases for a natural language system to be competent in its interactions with users based on very little training data (sometimes just one use of a new phrase). Unlike the problems discussed above with respect to improperly-classified inputs caused by structurally-small variations, the problem here involves the need to train a machine learning model over unboundedly many dialogue contexts and user updates. This is typically not possible even with a large collection of annotated training data.
(15) This disclosure provides various techniques related to structured adversarial training for natural language machine learning tasks. As described in more detail below, these techniques recognize that initial linguistic samples used in initial training data for a natural language machine learning model have structural invariants that should be respected during the training. This information can be used to generate additional linguistic samples in additional training data for the natural language machine learning model, where the additional linguistic samples maintain the structural invariants. Moreover, these techniques recognize that it is often useful to identify the additional linguistic samples that are very similar to the initial linguistic samples in the initial training data but that are misclassified by a natural language machine learning model. These are referred to as “adversarial examples” and could be based on only structurally-small variations in the initial training data. By combining these features, certain additional linguistic samples may be generated with three general properties: (i) they are relevant to a machine learning task associated with original training data, (ii) they need no new annotations relative to the original training data, and (iii) they correct errors made by a machine learning model being trained for that task.
(16) Based on this, the techniques described below generate additional linguistic samples in additional training data for a natural language machine learning model based on initial linguistic samples in initial training data, and the additional training data is used to train the machine learning model. This is accomplished by incorporating a precise definition of structural invariants to be preserved in discrete domains of linguistic tasks. As a result, structural invariants used in the initial linguistic samples of the initial training data can be preserved in the automatically-generated additional linguistic samples used in the additional training data. These techniques can also incorporate a naturalness metric that allows unnatural additional linguistic samples to be excluded from use during training, such as when an additional linguistic sample that is automatically generated uses words or phrases not typically seen together.
(17) These techniques also support searching for natural invariant-preserving additional linguistic samples that are “worst case” adversarial examples in the sense that the additional linguistic samples are very similar structurally to the original training data and yet maximally unexpected by a machine learning model. In other words, the additional linguistic samples may provide a higher likelihood of being misclassified by the machine learning model (even though they should be classified similarly as the initial linguistic samples), which provides a higher likelihood of training for the machine learning model. In some cases, the additional linguistic samples can be generated by making modifications to the initial linguistic samples, where the modifications are based on various “symmetries” applied to the initial linguistic samples. Example symmetries may include substitution symmetries (such as by replacing words or phrases with equivalent words or phrases or by substituting details that are irrelevant to a task), permutation symmetries (such as by switching the order of words or phrases), insertion/deletion symmetries (such as by inserting or removing articles or adjuncts, by inserting or removing politeness words or phrases, or by inserting or removing punctuation), and character-level or word-level symmetries (such as by swapping characters or words or adding or removing blank spaces to create typographical or grammatical errors). This allows known compositional adjustments to be made to the initial linguistic samples in order to generate the additional linguistic samples. Also, this can be accomplished without the need for new annotations to the additional linguistic samples, since the annotations of the initial linguistic samples can be preserved or updated based on the known compositional adjustments and used with the additional linguistic samples.
(18) In this way, it is possible to significantly increase the amount of training data available to train a natural language machine learning model. In some cases, very little initial training data (such as few initial linguistic samples) may be needed, and numerous additional linguistic samples may be generated based on the initial linguistic samples and used during training. Among other things, this allows a machine learning model to be trained and operate more effectively even in the presence of structurally-small variations in inputs. With a much larger collection of training data used for training, a machine learning model can classify its inputs correctly much more frequently, improving user satisfaction with an application that operates based on the model. In addition, this can be accomplished without the time and expense typically associated with using large collections of manually-generated annotated training data.
(19) In some cases, this functionality can be extended to supplement training dialogues, which represent collections of linguistic samples used to train a machine learning model in understanding task-oriented, multi-step dialogues or other types of multi-utterance dialogues. In these embodiments, knowledge of context can be used when generating the additional linguistic samples based on the initial linguistic samples. Also, original training dialogues can be modified (including based on the additional linguistic samples) to generate additional training dialogues. Again, various symmetries may be applied when generating the additional training dialogues. Example symmetries may include permutation symmetries (such as by switching the order of steps or subtasks in a dialogue) and interruption symmetries (such as by inserting a different dialogue into the steps of a current dialogue). Another naturalness metric can be used to allow unnatural dialogue to be excluded from use during training, and selected additional training dialogues may be used during training of a dialogue-based machine learning model. Again, this can significantly increase the amount of training data available to train a machine learning model, and this can be accomplished without the time and expense typically associated with using large collections of manually-generated annotated training data.
(20)
(21) According to embodiments of this disclosure, an electronic device 101 is included in the network configuration 100. The electronic device 101 can include at least one of a bus 110, a processor 120, a memory 130, an input/output (I/O) interface 150, a display 160, a communication interface 170, a sensor 180, or a speaker 190. In some embodiments, the electronic device 101 may exclude at least one of these components or may add at least one other component. The bus 110 includes a circuit for connecting the components 120-190 with one another and for transferring communications (such as control messages and/or data) between the components.
(22) The processor 120 includes one or more of a central processing unit (CPU), a graphics processor unit (GPU), an application processor (AP), or a communication processor (CP). The processor 120 is able to perform control on at least one of the other components of the electronic device 101 and/or perform an operation or data processing relating to communication. In some embodiments of this disclosure, the processor 120 may execute or otherwise provide structured adversarial training for one or more natural language machine learning tasks. In other embodiments of this disclosure, the processor 120 may interact with an external device or system that executes or otherwise provides structured adversarial training for one or more natural language machine learning tasks. In either case, the one or more machine learning tasks may be used to support interactions with users, including a user of the electronic device 101.
(23) The memory 130 can include a volatile and/or non-volatile memory. For example, the memory 130 can store commands or data related to at least one other component of the electronic device 101. According to embodiments of this disclosure, the memory 130 can store software and/or a program 140. The program 140 includes, for example, a kernel 141, middleware 143, an application programming interface (API) 145, and/or an application program (or “application”) 147. At least a portion of the kernel 141, middleware 143, or API 145 may be denoted an operating system (OS).
(24) The kernel 141 can control or manage system resources (such as the bus 110, processor 120, or memory 130) used to perform operations or functions implemented in other programs (such as the middleware 143, API 145, or application 147). The kernel 141 provides an interface that allows the middleware 143, the API 145, or the application 147 to access the individual components of the electronic device 101 to control or manage the system resources. The application 147 includes one or more applications for providing (or for interacting with a device or system that provides) structured adversarial training for one or more natural language machine learning tasks. These functions can be performed by a single application or by multiple applications that each carries out one or more of these functions. The middleware 143 can function as a relay to allow the API 145 or the application 147 to communicate data with the kernel 141, for instance. A plurality of applications 147 can be provided. The middleware 143 is able to control work requests received from the applications 147, such as by allocating the priority of using the system resources of the electronic device 101 (like the bus 110, the processor 120, or the memory 130) to at least one of the plurality of applications 147. The API 145 is an interface allowing the application 147 to control functions provided from the kernel 141 or the middleware 143. For example, the API 145 includes at least one interface or function (such as a command) for filing control, window control, image processing, or text control.
(25) The I/O interface 150 serves as an interface that can, for example, transfer commands or data input from a user or other external devices to other component(s) of the electronic device 101. The I/O interface 150 can also output commands or data received from other component(s) of the electronic device 101 to the user or the other external device.
(26) The display 160 includes, for example, a liquid crystal display (LCD), a light emitting diode (LED) display, an organic light emitting diode (OLED) display, a quantum-dot light emitting diode (QLED) display, a microelectromechanical systems (MEMS) display, or an electronic paper display. The display 160 can also be a depth-aware display, such as a multi-focal display. The display 160 is able to display, for example, various contents (such as text, images, videos, icons, or symbols) to the user. The display 160 can include a touchscreen and may receive, for example, a touch, gesture, proximity, or hovering input using an electronic pen or a body portion of the user.
(27) The communication interface 170, for example, is able to set up communication between the electronic device 101 and an external electronic device (such as a first electronic device 102, a second electronic device 104, or a server 106). For example, the communication interface 170 can be connected with a network 162 or 164 through wireless or wired communication to communicate with the external electronic device. The communication interface 170 can be a wired or wireless transceiver or any other component for transmitting and receiving signals, such as images.
(28) The wireless communication is able to use at least one of, for example, long term evolution (LTE), long term evolution-advanced (LTE-A), 5th generation wireless system (5G), millimeter-wave or 60 GHz wireless communication, Wireless USB, code division multiple access (CDMA), wideband code division multiple access (WCDMA), universal mobile telecommunication system (UMTS), wireless broadband (WiBro), or global system for mobile communication (GSM), as a cellular communication protocol. The wired connection can include, for example, at least one of a universal serial bus (USB), high definition multimedia interface (HDMI), recommended standard 232 (RS-232), or plain old telephone service (POTS). The network 162 or 164 includes at least one communication network, such as a computer network (like a local area network (LAN) or wide area network (WAN)), Internet, or a telephone network.
(29) The electronic device 101 further includes one or more sensors 180 that can meter a physical quantity or detect an activation state of the electronic device 101 and convert metered or detected information into an electrical signal. For example, one or more sensors 180 can include one or more microphones, which may be used to capture utterances from one or more users. The sensor(s) 180 can also include one or more buttons for touch input, one or more cameras, a gesture sensor, a gyroscope or gyro sensor, an air pressure sensor, a magnetic sensor or magnetometer, an acceleration sensor or accelerometer, a grip sensor, a proximity sensor, a color sensor (such as a red green blue (RGB) sensor), a bio-physical sensor, a temperature sensor, a humidity sensor, an illumination sensor, an ultraviolet (UV) sensor, an electromyography (EMG) sensor, an electroencephalogram (EEG) sensor, an electrocardiogram (ECG) sensor, an infrared (IR) sensor, an ultrasound sensor, an iris sensor, or a fingerprint sensor. The sensor(s) 180 can further include an inertial measurement unit, which can include one or more accelerometers, gyroscopes, and other components. In addition, the sensor(s) 180 can include a control circuit for controlling at least one of the sensors included here. Any of these sensor(s) 180 can be located within the electronic device 101.
(30) In addition, the electronic device 101 may include one or more speakers 190 that can convert electrical signals into audible sounds. For example, one or more speakers 190 may be used to audibly interact with at least one user. As a particular example, one or more speakers 190 may be used to provide verbal communications associated with a virtual assistant to at least one user. Of course, interactions with users may occur in any other or additional manner, such as via the display 160.
(31) The first external electronic device 102 or the second external electronic device 104 can be a wearable device or an electronic device-mountable wearable device (such as an HMD). When the electronic device 101 is mounted in the electronic device 102 (such as the HMD), the electronic device 101 can communicate with the electronic device 102 through the communication interface 170. The electronic device 101 can be directly connected with the electronic device 102 to communicate with the electronic device 102 without involving with a separate network. The electronic device 101 can also be an augmented reality wearable device, such as eyeglasses, that include one or more cameras.
(32) The first and second external electronic devices 102 and 104 and the server 106 each can be a device of the same or a different type from the electronic device 101. According to certain embodiments of this disclosure, the server 106 includes a group of one or more servers. Also, according to certain embodiments of this disclosure, all or some of the operations executed on the electronic device 101 can be executed on another or multiple other electronic devices (such as the electronic devices 102 and 104 or server 106). Further, according to certain embodiments of this disclosure, when the electronic device 101 should perform some function or service automatically or at a request, the electronic device 101, instead of executing the function or service on its own or additionally, can request another device (such as electronic devices 102 and 104 or server 106) to perform at least some functions associated therewith. The other electronic device (such as electronic devices 102 and 104 or server 106) is able to execute the requested functions or additional functions and transfer a result of the execution to the electronic device 101. The electronic device 101 can provide a requested function or service by processing the received result as it is or additionally. To that end, a cloud computing, distributed computing, or client-server computing technique may be used, for example. While
(33) The server 106 can include the same or similar components 110-190 as the electronic device 101 (or a suitable subset thereof). The server 106 can support to drive the electronic device 101 by performing at least one of operations (or functions) implemented on the electronic device 101. For example, the server 106 can include a processing module or processor that may support the processor 120 implemented in the electronic device 101. In some embodiments, the server 106 provides or otherwise supports structured adversarial training for one or more natural language machine learning tasks, and the one or more machine learning tasks may be used to support interactions with users, including a user of the electronic device 101.
(34) Although
(35)
(36) As noted above, an adversarial example is a sample of input data that has been modified (typically very slightly) in a way intended to cause a machine learning model to misclassify the sample. When such examples exist, they indicate that the machine learning model has missed some kind of generalization that can otherwise be used to process input samples and classify them correctly. The notion of adversarial examples in the language domain can refer to situations where one or more modifications made to input examples (the initial linguistic samples) are “close” in terms of linguistic structure. The modifications may not necessarily be slight at the character-by-character or word-by-word level, at the word or phrase embedding level, or in terms of generally preserved meaning. However, because the adversarial examples are close in terms of linguistic structure, slight perturbations in linguistic structure can be made and allow for correct classifications to be computed compositionally, allowing misclassifications to be detected without new manually-created annotations.
(37) The process of identifying adversarial examples that can be used as additional linguistic samples for use in training a natural language machine learning model is summarized in
(38) In the context of
(39) The expression “M(˜x) !=p(˜x)=≈p(x)” indicates that the set AM includes adversarial examples, since the classifications by the machine learning model do not match the known correct classifications. As a result, when the functions 208 and 212 are chosen to make the set AM non-empty, the members of the set AM represent a set of adversarial examples for the machine learning model as defined by the functions 208 and 212. Note that the requirement here that p(˜x)=≈p(x) simply indicates that the perturbation operators (˜, ≈) commute with p in the standard sense. If the domain and range of p are disjoint and if ˜ and ≈ satisfy the condition above, the union of ˜ and ≈ is a homomorphism, which represents a symmetry that preserves p. For any two such homomorphisms f=(˜, ≈) and g=(˜′, ≈′) that preserve p, their composition f∘g is also a homomorphism that preserves p. Thus, given any set S.sup.l of such homomorphisms, it is possible to define S.sup.i+1={f∘g|f∈S.sup.i, g∈S.sup.l}. Intuitively, elements of S.sup.i map an element (x, p(x)) for x∈I to a new element (˜x, p(˜x)) that is i steps away, preserving p.
(40) Training with samples from one or more adversarial sets AM can sometimes help a machine learning model to recognize missed generalizations about p. As a result, training with samples from {(y, p(y)) y ∈AM} can help train the machine learning model more effectively. When ≈ is easy to calculate, there is little or no need to manually collect new values of p for elements of AM since {(˜x, p(˜x)) x∈I}={(˜x, ≈p(x))|x∈I}, so no new manual annotations may be needed. Given this, it is possible to find the “closest” adversarial set AM of n examples in this discrete setting with an initial set S.sup.l of perturbations to explore. To do this, results obtained from S.sup.i (for i=1, 2, . . . ) can be searched, as long as incrementing i adds new examples. In settings where the correspondence between M(˜x) and p(˜x) are scored, the “worst” examples can be preferred at each stage i. This means that the cases where the score for M(˜x) is maximally worse than the score for M(x) can be selected, since these are the adversarial examples most likely to result in misclassification by the machine learning system. Again, however, no manually-created annotations may be needed here, since the annotations of the initial data 204 can be used (or modified) automatically.
(41) Note that this broad definition of adversarial examples does not require that ˜x be perceptually similar to an original training element x or that ˜x and x even mean the same thing. The functions 208, 212 may represent any functions that preserve the linguistic structure imposed by p. In some cases, the function 208 may be chosen in such a way that ˜x is natural in the intended domain, meaning an element that further real-world sampling could have eventually discovered. Also, the function 208 may be chosen in such a way that ≈p(x) is easily computed. This can be achieved by restricting the function 208 to structural edits of the initial training data 204, which allows compositional meaning calculation. With these policies, manually-created annotations for the additional training data 210 are not needed, and the additional training data 210 may use the same annotations as the initial training data 204 or modified versions of the initial training data's annotations (based on known compositional meaning changes). This subsumes the case where the function 212 is an identity function but extends to a much broader range of cases, examples of which are discussed below.
(42) It should be noted here that many languages are associated with a linguistic structure containing a number of structural invariants. Structural invariants generally represent linguistic objects (meaning expressions, properties of expressions, and relations between expressions) that are mapped to themselves, so one linguistic object cannot be replaced by another linguistic object while preserving how the meaning of an expression is determined. More formally, this can be expressed by defining structural invariants as representing linguistic objects that are preserved by stable automorphisms. Essentially, a structural invariant represents a property of linguistic samples that remains robust even under slight modifications. In the context here, a structural invariant may be defined (at least in part) on the specific task or tasks to be learned by a natural language model, meaning the structural invariant may be based at least partially on the associated domain. For instance, in an IoT domain, structural invariants may relate to the types of commands to be supported in the IoT domain.
(43) Two example implementations are provided below for using the technique 200 to supplement initial linguistic samples in initial training data (the data 204) with additional linguistic samples in additional training data (the output values 214). The additional linguistic samples in the additional training data represent adversarial examples since they are likely to be misclassified by a natural language machine learning model. Moreover, the additional linguistic samples in the additional training data can be generated such that annotations of the additional linguistic samples match or are based on annotations of the initial linguistic samples. As a result, the additional linguistic samples may be used to improve the training of the natural language machine learning model, without requiring that the additional linguistic samples be collected and annotated manually.
(44) Although
(45)
(46) As shown in
(47) A machine learning algorithm 308 is executed and used to train a machine learning model using (among other things) the initial training data 306, plus additional training data that is generated as described below. The training performed using the machine learning algorithm 308 is typically iterative in nature. In this type of training process, the machine learning algorithm 308 receives training data and generates an intermediate language model based on that training data, and a determination is made whether the intermediate language model is adequately accurate. If not, the machine learning algorithm 308 performs another training iteration (possibly using more or different training data) to generate another intermediate language model, and a determination is made whether that intermediate language model is adequately accurate. Accuracy here can be measured in any suitable manner, such as by using F.sub.1 scores. This process can be repeated over any number of iterations, typically until a language model is trained that has at least some desired level of accuracy. The language model can then be output as a final machine learning model 310, and the model 310 may then be used in a desired natural language application. The model 310 may be used in any suitable natural language application, such as a conversational assistant, a question answering (QA) system, an IoT home automation interface, a video game system, or an educational system. Note that any suitable machine learning algorithm 308 (now known or later developed) may be used here depending on the application.
(48) As described above, if the initial training data 306 is small or otherwise inadequate, it may be difficult to obtain a model 310 having the desired level of accuracy. In order to help overcome these or other problems, the functional architecture 300 supports the technique 200 described above with respect to
(49) The symmetries selected with the symmetries definition process 312 are used by an adversarial sample generation process 314, which represents an automated process that implements the symmetries selected by the symmetries definition process 312. In this example, the adversarial sample generation process 314 receives an intermediate model 316 (denoted Model.sub.i) generated by the machine learning algorithm 308 in one iteration of the training process. The adversarial sample generation process 314 uses the intermediate model 316 to produce additional training data 318 (denoted Data.sub.i+=1) for use during a subsequent iteration of the training process. The additional training data 318 includes additional linguistic samples, which represent initial linguistic samples from the initial training data 306 that have been modified in accordance with one or more of the symmetries selected by the symmetries definition process 312. At least some of the additional linguistic samples selected (based on the model 316) for use in the subsequent training iteration are adversarial examples, meaning the additional linguistic samples are selected based on their likelihood of being misclassified by the machine learning algorithm 308 during the subsequent iteration of the training process.
(50) Note that the adversarial sample generation process 314 can make one or multiple changes to the initial linguistic samples from the initial training data 306 in order to generate the additional linguistic samples in the additional training data 318. In some embodiments, for example, the adversarial sample generation process 314 may make single changes to the initial linguistic samples in order to generate additional linguistic samples. If more adversarial examples are needed, the adversarial sample generation process 314 may then make two changes to the initial linguistic samples in order to generate more additional linguistic samples. The number of changes may continue to increase until a desired number of adversarial examples is obtained or some threshold number of changes is met. Note, however, that the additional linguistic samples may be generated in any other suitable manner based on any suitable number of changes to the initial linguistic samples. Also note that the additional linguistic samples selected for use in the additional training data 318 may represent adversarial examples that are as close as possible to the initial linguistic samples from the initial training data 306 while still causing the machine learning algorithm 308 to misclassify the additional linguistic samples.
(51) The adversarial sample generation process 314 may be performed in any suitable manner. For example, the adversarial sample generation process 314 may be implemented using software instructions that are executed by the processor 120 of the electronic device 101, server 106, or other component(s) in
(52) Naturalness evaluation data 320 may be used to provide information to the adversarial sample generation process 314 for use in excluding unnatural automatically-generated additional linguistic samples. For example, the naturalness evaluation data 320 may represent or include n-grams associated with the language used in the initial training data 306. The adversarial sample generation process 314 can use the naturalness evaluation data 320 to identify additional linguistic samples that are generated from the initial linguistic samples using the identified symmetries but that might appear unnatural (and therefore unlikely to be found during real-world use). Among other things, the adversarial sample generation process 314 may use the naturalness evaluation data 320 to identify additional linguistic samples that contain words or phrases typically not used next to each other in a sentence.
(53) In some embodiments, the components of the functional architecture 300 may operate according to the following pseudocode. It should be noted, however, that other implementations of the functional architecture 300 are also possible. The operations of the machine learning algorithm 308 may be expressed as follows. Data:=Seed Data, (Input, Value) Pairs Model:=ML(Data) While Accuracy(Model)<Requirement: Data+=Worst Case Adversaries(Ext, Rate, Model) Model:=ML(Data)
This indicates that the initial data 306 used for training (referred to as Data) includes the seed data and any associated input data 204/output value 206 pairs. The machine learning algorithm 308 is applied to this data in order to train an initial model (referred to as Model). If the accuracy of the initial model is below some threshold (referred to as Requirement), an iterative process is performed. During the iterative process, the training data is supplemented using worst-case adversarial examples provided by the adversarial sample generation process 314 as the additional training data 318, and another model is trained using the supplemented set of training data (which includes the adversarial examples). The adversarial examples produced here are a function of naturalness evaluation data 320 (referred to as Ext), a specified number of adversarial examples injected into the training data (referred to as Rate), and a model from the prior training iteration (which is used to identify adversarial examples likely to be misclassified by the machine learning algorithm 308). In some cases, the Rate value may vary during different iterations of the training process and can be tuned for each application. As a particular example, the Rate value may be set to 100% for the first iteration (meaning the adversarial sample generation process 314 can double the number of linguistic samples used for training), and other values (such as smaller values) can be used in subsequent iterations.
(54) The operations of the sample symmetries definition process 312 may be expressed as follows. Given the task mapping p: inputs.fwdarw.values: Define a set S.sup.l of symmetries of p that are minimal input-edit, output-edit mappings (˜, ≈) that commute with p such that: p(˜x)=≈p(x). Define S.sup.i+1={f∘g|f∈S.sup.i, g∈S.sup.l} so that elements of S.sup.i map (input-value) pairs to other (input-value) pairs in i steps.
(55) The operations of the adversarial sample generation process 314 may be expressed as follows. Here, S.sup.l is obtained from the sample symmetries definition process 312, and the current intermediate model is obtained from the machine learning algorithm 308.
(56) TABLE-US-00001 Adversaries A := 0 For d ∈ [1, 2, ...]: For (~, ≈) ∈ S.sup.d: For (x, ν) ∈ Data: If Natural(~x, ext) and F.sup.1(p(~x), ≈p(x)) << F.sup.1(x, p(x)): A :=A ∪ {(~x, ≈p(x))} If |A| ≥ Rate(Data): break Return A
Here, a set of adversarial examples (referred to as Adversaries) is initially empty. New adversarial examples are added to the set if the adversarial examples are natural (as defined by the naturalness evaluation data 320) and are selected based on their likelihood of being misclassified by the machine learning algorithm 308. The likelihood of being misclassified is defined here using F.sub.1 scores, which are computed using the model obtained from the machine learning algorithm 308. This is defined in the expression “F.sup.1(p(˜x), ≈p(x))<<F.sup.1(x, p(x))”, where “F.sup.1(x, p(x))” represents the score of an initial linguistic sample with a known classification (based on its annotation) and “F.sup.1(p(˜x), ≈p(x))” represents the score of an additional linguistic sample that should be classified similarly as the initial linguistic sample but is not when a misclassification may occur.
(57) Note that the adversarial sample generation process 314 here can operate to produce additional training data 318 for the machine learning algorithm 308 while having little or no knowledge of how the machine learning algorithm 308 actually operates. Thus, the machine learning algorithm 308 may appear as a “black box” to the adversarial sample generation process 314, since the adversarial sample generation process 314 receives models 316 from the machine learning algorithm 308 and provides additional training data 318 to the machine learning algorithm 308 without knowledge of the machine learning algorithm's actual operation. Based on this, the actual implementation of the machine learning algorithm 308 may be immaterial, at least with respect to the adversarial sample generation process 314.
(58)
(59) This substitution is natural because “bad” is a very common adjectival modifier for traffic (and may even be more common than “good” in some n-gram counts). Also, “really” is a very common domain-neutral intensifier of “bad.” Further, no change is needed on the output side because neither GET_INFO_TRAFFIC nor any other tag in this dataset distinguishes between positive and negative sentiments. Thus, the tree 404 can be formed in two steps, namely replacement of “good” with “bad” and insertion of the “really” adjunct.
(60) Some relevant considerations for (string-edit, tree-edit) pairs involve syntactic analyses of the inputs (such as what modifies what, etc.) and naturalness assessments (such as based on external n-gram frequencies or other naturalness evaluation data 320). Even when a semantic task is restricted (such as to a limited number of intents and a limited number of slots) of this rather sparse formalism, even hundreds of training examples may not be adequate. Thus, baseline F.sub.1 scores may be relatively low, and adversarial examples close to the training dataset can be very easy to find.
(61) Since traffic and weather share many contexts in a training set (such as “Is there bad_”), the naturalness of contexts can be assessed for ones that are not shared by the other. For example, the word “weather” is natural in the context “Is there good ______”, just as “traffic” is natural in that context. This may simply require only a change of the root tag on the output side from GET_INFO_TRAFFIC to GET_INFO_ROAD_CONDITION, which appears in a tree 406 shown in
(62) This strategy of recursively maximizing overlapping substring sets and context sets maintains proximity to the original training dataset and is familiar from learning-theoretic approaches to language identification. Expanding with most frequent expressions first, it is possible to converge on a coarsest congruence of substrings that preserves the mapping p, where attention is restricted to modifications that do not cross the constituent boundaries of any element on the output side, modifications that preserve natural slot and intent semantics, and modifications that can be defined in terms of syntactic constituents (head plus contiguous dependents).
(63) In some embodiments, various types of symmetries can be supported to recursively maximize overlapping substring sets and context sets while maintaining proximity to the original training dataset. The space defined by (string-edit, tree-edit) mappings of each of the following kinds can be considered and used, where these mappings are designed to preserve a semantic parse mapping and naturalness in the domain. For each kind of symmetry discussed below, an example input and some of its perturbations are provided (showing only the input since the needed changes on the output side are obvious). As discussed above, the perturbations do not need to preserve character-by-character identity or embeddings or meanings. Rather, the choice of these edits is to map expressions to other expressions that are “close” in terms of linguistic structure, so tree-edits corresponding to input-edits are easy to define. The following rules may be trivial and easy to implement for fluent English speakers, and similar types of symmetries may be used for various other languages, as well.
(64) A first type of symmetry is referred to as substitution symmetries, where words or phrases are replaced with equivalent words or phrases or where details that are irrelevant to a task are substituted. Using this type of symmetry, only substitutions that are adversarial (cause a machine learning system misclassification) may be used. Also, the relevant sense of “equivalent” may depend on the specific machine learning task, and equivalents may come in (string-edit, tree-edit) pairs so even dominating intents or slots can change. In addition, all changes can be assessed for syntactic compatibility and naturalness, such as based on the naturalness evaluation data 320. Intuitively, these changes are “structurally small” but may not generally be small in surface form, meaning, or embeddings. Examples of this type of symmetry are as follows: Is the traffic good today? Is the traffic bad today? Is the weather good today? Is the traffic good between 5 and 6 pm during weekdays? Is the traffic good on route 7 or route 40?
(65) A second type of symmetry is referred to as permutation symmetries, where the order of words or phrases is switched or otherwise altered. This type of symmetry can involve changing verb-particle placements, preposing or postposing phrases, or swapping phrases. Examples of this type of symmetry are as follows: Is the traffic good today? The traffic is good today? The traffic, is it good today? The traffic, good today? Is it good today, the traffic? Is it good, the traffic today?
(66) A third type of symmetry is referred to as article or adjunct insertion/removal symmetries, where articles, adjuncts, or other modifiers can be added or removed. Examples of this type of symmetry are as follows: Is the traffic good today? Is traffic good today? Is the traffic good? Is the traffic unusually good today? Is the traffic good today on route 280? Hey, is the traffic good?
(67) A fourth type of symmetry is referred to as politeness insertion/removal symmetries, where politeness words or phrases can be added or removed. This may be particularly useful in task-oriented applications, although it is also useful in other applications. Examples of this type of symmetry are as follows: Is the traffic good today? Tell me, is the traffic good today? Please check, is the traffic good today?
(68) A fifth type of symmetry is referred to as punctuation insertion/removal symmetries, where punctuation can be added or removed. This may be particularly useful in applications where inputs are typed, although it is also useful in other applications. Examples of this type of symmetry are as follows: Is the traffic good today? Is the traffic good today Is the traffic good ? today? Is. the. traffic. good. TODAY?
(69) A sixth type of symmetry is referred to as character-level or word-level symmetries, where characters or words are manipulated to create typographical or grammatical errors. This can include swapping characters or words, as well as adding or removing blank spaces. Examples of this type of symmetry are as follows: Is the traffic good today? Is thetraffic good today? Is the rtaffic good tday/?
(70) Any or all of these example types of symmetries (as well as other or additional types of symmetries) may be used by the adversarial sample generation process 314 to create adversarial examples used to train a machine learning model. While certain types of symmetries are familiar in the data expansion literature, the adversarial use of those symmetries can be more beneficial than standard brute-force expansions. Thus, the approach described here can be used to improve model accuracy and can be more effective than standard data augmentation since (i) this approach identifies and trains using adversarial examples and (ii) this approach adapts to the model during training. As noted above, there are numerous perturbations of these sorts that might be used to generate the additional training data 318 based on the initial training data 306. In some cases, the perturbations can be ranked by the number of structure-based edits used in their derivation, so perturbations requiring fewer structure-based edits may be used before perturbations requiring more structure-based edits. Also, the use of the naturalness evaluation data 320 can help to ensure that the perturbations are natural, such as in the sense of not introducing uncommon n-grams. With this strategy, it is possible to identify any number of adversarial examples using original training data, whether the adversarial examples are generated in one or more steps.
(71) The functional architecture 300 shown in
(72) Although
(73)
(74) As shown in
(75) Additional training data for use in adversarial training of the machine learning model is generated at step 506. In this example, the additional training data is generated using various steps. Here, symmetries to be used to modify the first linguistic samples are identified at step 508. This may include, for example, the processor 120 of the electronic device 101 or server 106 obtaining information identifying (or itself identifying) one or more symmetries to be used to modify the initial linguistic samples in the initial training data 306. As described above, the symmetries here can be selected so that minor perturbations in linguistic structure lead to misclassifications by the machine learning algorithm 308. Intermediate linguistic samples are generated at step 510. This may include, for example, the processor 120 of the electronic device 101 or server 106 applying one or more of the identified symmetries to the initial linguistic samples in the initial training data 306. In some cases, the processor 120 may apply a single symmetry to create single changes to the initial linguistic samples in the initial training data 306, and the number of changes applied may increase until an adequate number of intermediate linguistic samples are generated. Unnatural intermediate linguistic samples are filtered and thereby removed from consideration at step 512. This may include, for example, the processor 120 of the electronic device 101 or server 106 using n-grams or other naturalness evaluation data 320 to identify any of the intermediate linguistic samples that might contain unnatural or unexpected content. Worst-case intermediate linguistic samples are identified as adversarial examples and selected for use as second linguistic samples at step 514. This may include, for example, the processor 120 of the electronic device 101 or server 106 using scores associated with the intermediate linguistic samples (such as F.sub.1 scores) to identify which of the intermediate linguistic samples are likely to be improperly classified by the machine learning algorithm 308, while their corresponding initial linguistic samples are properly classified by the machine learning algorithm 308. This can be determined based on the initial model that has been trained.
(76) The natural language machine learning model is adversarially trained using the second linguistic samples as additional training data at step 516. This may include, for example, the processor 120 of the electronic device 101 or server 106 executing the machine learning algorithm 308 to train an intermediate language model. Ideally, the intermediate model is more accurate than the prior version of the model, since the machine learning algorithm 308 has used adversarial examples generated specifically to help the machine learning algorithm 308 learn at least one additional generalization that can be used to classify input samples correctly. A determination is made whether the training is adequate at step 518. This may include, for example, the processor 120 of the electronic device 101 or server 106 determining whether the current intermediate language model is sufficiently accurate (such as based on an F.sub.1 or other score). If not, the process returns to step 510 to perform another iteration of the training process with more additional training data. Otherwise, the current model is output as a trained machine learning model at step 520. This may include, for example, the processor 120 of the electronic device 101 or server 106 providing the current model as the final machine learning model 310, which can be used in a desired natural language application.
(77) Although
(78)
(79) The approach described above with respect to
(80) Human task-oriented dialogue is typically collaborative, plan-based, and flexible, and it often introduces new terms and predicates that human speakers can typically track unproblematically. Ideally, a well-trained natural language model can track what happens at each step of a dialogue relative to the dialogue's context. However, the flexibility of dialogue makes end-to-end training of natural language machine learning models difficult, even when the domain is restricted and large amounts of training data are available. Structured adversarial training of a natural language machine learning model for dialogue, respecting the invariants of a domain, can address these shortcomings. As described below, this approach can treat utterance-level invariants similar to semantic parsing invariants as described above with respect to
(81) As shown in
(82) As described above, if the initial training data 606 is small or otherwise inadequate, it may be difficult to obtain a model 610 having the desired level of accuracy. Also, it may be desirable for a natural language system to be competent in its interactions with users based on very little training data, which can involve training a machine learning model over many dialogue contexts and user updates. In order to help overcome these or other problems, the functional architecture 600 supports the technique 200 described above with respect to
(83) In this example, this is accomplished by performing a sample symmetries definition process 612, which is used to identify different symmetries that can be applied to the initial linguistic samples in the initial training data 606 in order to generate additional linguistic samples. This also includes using an adversarial sample generation process 614, which represents an automated process that implements the symmetries selected by the symmetries definition process 612 in order to generate the additional linguistic samples. The additional linguistic samples generated here can represent utterances that may be used to generate additional dialogue-based training data 618 for use by the machine learning algorithm 608. These components 612, 614 may be the same as or similar to the corresponding components 312, 314 in
(84) The functional architecture 600 here also includes a dialogue symmetries definition process 622, which is used to identify different symmetries that can be applied to the dialogue-based initial training data 606 in order to generate additional dialogue-based training data 618. The different symmetries applied here can be used to help maintain various characteristics of the dialogue-based initial training data 606 while still allowing adversarial examples to be created. The dialogue symmetries definition process 622 may be performed manually or in an automated manner.
(85) Examples of the types of symmetries that may be identified and selected in the dialogue symmetries definition process 622 can include permutation symmetries, which involve switching the order of steps or subtasks in a dialogue. For example, assume that a training dialogue includes subtasks T.sub.1, . . . , T.sub.k (also sometimes referred to as “turns”) and T.sub.i and T.sub.i+1 involve subtasks that are unordered (meaning subtasks whose order is immaterial). Given this, a permutation symmetry can generate an additional training dialogue with the T.sub.i and T.sub.i+1 subtasks reversed. If multiple sets of subtasks in the same training dialogue are unordered, various combinations of subtask permutations can be created here based on a single training dialogue.
(86) Another example of the types of symmetries that may be identified and selected in the dialogue symmetries definition process 622 can include interruption symmetries, which involve inserting a second dialogue into the steps of a first dialogue (effectively interrupting the execution of the first dialogue). For example, assume a first training dialogue includes subtasks T.sub.1, . . . , T.sub.k and a second training dialogue includes subtasks S.sub.1, . . . , S.sub.j. Given that, an interruption symmetry can generate an additional training dialogue with subtasks T.sub.1, . . . , T.sub.i, S.sup.1, . . . S.sub.j, T.sub.i+1, . . . , T.sub.k. If the first training dialogue can be interrupted at various points within the first training dialogue, multiple additional training dialogues may be generated by inserting the second training dialogue at different points into the first training dialogue.
(87) Other types of symmetries that may be identified and selected in the dialogue symmetries definition process 622 could include those defined using rules for interruption removal, subtask addition or deletion, and various kinds of engagement maintenance. As before, changes on the input side may entail changes on the output dialogue representation. Also, utterance-level adjunct insertion may use one or more adjuncts from at least one other utterance (thereby “coarsening” the lattice of utterance-context pairs), and interruption insertion may find one or more interruptions to insert in at least one other dialogue.
(88) The symmetries selected by the dialogue symmetries definition process 622 are used by an adversarial dialogue generation process 624, which represents an automated process that implements the symmetries selected by the dialogue symmetries definition process 622. In this example, the adversarial dialogue generation process 624 receives the intermediate model 616 generated by the machine learning algorithm 608 in one iteration of the training process, and the adversarial dialogue generation process 624 uses the intermediate model 616 to produce additional training data 618 (denoted Dialogue Data.sub.i+=1) for use during a subsequent iteration of the training process. The additional training data 618 includes additional dialogue training samples, each of which includes utterances that may be from the original training data 606 or from the adversarial sample generation process 614. At least some of the additional dialogue training samples selected (based on the model 616) for use in the subsequent training iteration are adversarial examples, meaning the additional dialogue training samples are selected based on their likelihood of being misclassified by the machine learning algorithm 608 during the subsequent iteration of the training process.
(89) At least some of the additional training dialogues generated by the adversarial dialogue generation process 624 may include initial linguistic samples (utterances) from the dialogues in the initial training data 606 after one or more of the symmetries selected by the dialogue symmetries definition process 622 have been applied. Thus, some additional training dialogues may include the initial linguistic samples from the initial training dialogues, but several of the initial linguistic samples may be reordered in the additional training dialogues (compared to the initial training dialogues) to provide permutation symmetry. Also, some additional training dialogues may include the initial linguistic samples from the initial training dialogues, but the initial linguistic samples from one initial training dialogue may be inserted at one or more points into another initial training dialogue to provide interruption symmetry. In this way, the adversarial dialogue generation process 624 may generate a number of additional dialogue-based training samples for the machine learning algorithm 608.
(90) Not only that, at least some of the additional training dialogues generated by the adversarial dialogue generation process 624 include additional linguistic samples (utterances) generated by the adversarial sample generation process 614 based on the initial linguistic samples contained in the dialogues of the initial training data 606. For example, the adversarial dialogue generation process 624 can replace initial linguistic samples in the dialogues of the initial training data 606 with additional linguistic samples generated by the adversarial sample generation process 614. Here, the adversarial sample generation process 614 uses the symmetries selected by the symmetries definition process 612 in order to generate the additional linguistic samples that can be inserted in place of or used with the initial linguistic samples in the training data 606 to generate additional training dialogues.
(91) In this process, the adversarial sample generation process 614 may support various symmetries for generating a large number of additional linguistic samples. Also, the adversarial dialogue generation process 624 may support various symmetries for generating a large number of additional dialogue samples. By combining these approaches, it is possible to generate an enormous amount of dialogue-based additional training data 618 for the machine learning algorithm 608, even if only a limited amount of initial training data 606 is available. For example, the adversarial dialogue generation process 624 may reorder the linguistic samples contained in initial training dialogues and also replace the linguistic samples in the initial training dialogues with corresponding additional linguistic samples from the adversarial sample generation process 614. Similarly, the adversarial dialogue generation process 624 may interrupt the linguistic samples contained in initial training dialogues with the linguistic samples contained in other initial training dialogues and also replace the linguistic samples in the initial training dialogues with corresponding additional linguistic samples from the adversarial sample generation process 614. Essentially, this approach multiplies the number of permutations used to generate the additional linguistic samples with the number of permutations used to modify the initial training dialogues.
(92) Naturalness evaluation data 620 can be used here to provide information to the adversarial sample generation process 614 and to the adversarial dialogue generation process 624 for use in excluding unnatural automatically-generated additional linguistic samples and unnatural automatically-generated additional dialogues, respectively. For example, the naturalness evaluation data 620 may represent or include n-grams associated with the language used in the initial training data 606, as well as knowledge graphs, dialogue models, or other information identifying what types of dialogues are deemed natural or unnatural. The sample symmetries definition process 612 can use at least some of the naturalness evaluation data 620 to exclude unnatural additional linguistic samples from being provided to the adversarial dialogue generation process 624. The adversarial dialogue generation process 624 can use at least some of the naturalness evaluation data 620 to determine dialogue naturalness based on, for instance, dialogue properties, knowledge-based reasoning, or background knowledge about the tasks involved. As a particular example of this, a reservation dialogue would typically not involve a user making a reservation and then providing information about the reservation, such as its time and location. Thus, the reordering of the subtasks associated with the reservation dialogue may be allowed, but certain orderings may be excluded as being unnatural. In some cases, changes at the utterance level of a dialogue may be allowed to alter a previous context, as long as the result is still natural as assessed using the naturalness evaluation data 620.
(93) In some embodiments, the components of the functional architecture 600 may operate according to the following pseudocode. It should be noted, however, that other implementations of the functional architecture 600 are also possible. The operations of the machine learning algorithm 608 may be expressed using the same pseudocode described above with respect to the operations of the machine learning algorithm 308. While the machine learning algorithm 608 in this example uses training data based on dialogues, the overall process performed by the machine learning algorithm 608 can be the same as (or substantially similar to) the operations of the machine learning algorithm 308 described above.
(94) The operations of the sample symmetries definition process 612 and the dialogue symmetries definition process 622 may be expressed as follows. Given the task mapping p: dialogues.fwdarw.values: Define sets S.sup.1.sub.D, S.sup.1.sub.U of dialogue and utterance symmetries of p that are minimal input-edit, output-edit mappings (˜, ≈) that commute with p such that: P(˜x)=≈p(x). Define S.sup.i+1={f∘g|f∈S.sup.i.sub.a, g∈S.sup.l.sub.b for a, b∈{D, U}} so that elements of S.sup.i map (dialogue-value) pairs to other (dialogue-value) pairs in i steps.
Note that utterance (U) and dialogue (D) perturbations here can be applied in any suitable order, and both are contained in or accounted for by the “distance” index i. The perturbations may be weighted differently, although they may also be weighted equally for simplicity.
(95) The operations of the adversarial sample generation process 614 and the adversarial dialogue generation process 624 may be expressed as follows, where S.sup.l is obtained from the sample symmetries definition process 612 and the current intermediate model is obtained from the machine learning algorithm 608.
(96) TABLE-US-00002 Adversaries A := 0 For d ∈ [1, 2, ...]: For (~, ≈) ∈ S.sup.d: For (x, ν) ∈ Data: If Natural.sub.U(~x, Context, n-gram) and Natural.sub.D(~x, Model, KB) and F.sup.1(p(~x), ≈p(x)) << F.sup.1(x, p(x)): A :=A ∪ {(~x, ≈p(x))} If |A| ≥ Rate(Data): break Return A
Again, the set of adversarial examples is initially blank, and new dialogue-based adversarial examples are added to the set, assuming (i) the linguistic samples in the dialogue-based adversarial examples are natural at the utterance-level and (ii) the dialogue-based adversarial examples are natural at the dialogue-level. Here, utterance-level naturalness can be assessed based on the context and n-grams, and dialogue-level naturalness can be assessed based on the properties of the intermediate model 616 and other knowledge (such as knowledge graphs or dialogue models). Also, the scores here (which in this example are F.sub.1 scores) may be generated per dialogue, since the full prior context may be needed for scoring most dialogue state tracking annotations.
(97) Once again, the adversarial sample generation process 614 and the adversarial dialogue generation process 624 here can operate to produce additional training data 618 for the machine learning algorithm 608 while having little or no knowledge of how the machine learning algorithm 608 actually operates. Based on this, the actual implementation of the machine learning algorithm 608 may be immaterial, at least with respect to the adversarial sample generation process 614 and the adversarial dialogue generation process 624.
(98) The functional architecture 600 shown in
(99) Although
(100)
(101) As shown in
(102) Additional training data for use in adversarial training of the machine learning model is generated at step 706. In this example, the additional training data is generated using various steps. Here, second linguistic samples are generated based on the first linguistic samples and first symmetries at step 708. This may include, for example, the processor 120 of the electronic device 101 or server 106 performing steps 508-514 as described above. Second symmetries to be used to modify the training dialogues are identified at step 710. This may include, for example, the processor 120 of the electronic device 101 or server 106 obtaining information identifying (or itself identifying) one or more symmetries to be used to modify the initial training dialogues in the initial training data 606. As described above, the symmetries here can be selected so that utterances in the initial training dialogues are reordered, interrupted, or otherwise used in a manner likely to lead to misclassifications by the machine learning algorithm 608. Intermediate dialogues are generated at step 712. This may include, for example, the processor 120 of the electronic device 101 or server 106 applying one or more of the identified dialogue symmetries to the initial training dialogues in the initial training data 606 and, in at least some cases, inserting the second linguistic samples in place of the first linguistic samples in the training dialogues. Unnatural dialogues are filtered and thereby removed from consideration at step 714. This may include, for example, the processor 120 of the electronic device 101 or server 106 using properties of the initial model 616 and knowledge graphs, dialogue models, or other naturalness evaluation data 620 to identify any of the intermediate dialogues that might contain unnatural or unexpected content. Worst-case intermediate dialogues are identified as adversarial examples and selected for use during training at step 716. This may include, for example, the processor 120 of the electronic device 101 or server 106 using scores associated with the intermediate dialogues (such as F.sub.1 scores) to identify which of the intermediate dialogues are likely to be improperly classified by the machine learning algorithm 608, while their corresponding initial dialogues are properly classified by the machine learning algorithm 608. This can be determined based on the initial model that has been trained.
(103) The natural language machine learning model is adversarially trained using the selected dialogues as additional training data at step 718. This may include, for example, the processor 120 of the electronic device 101 or server 106 executing the machine learning algorithm 608 to train an intermediate dialogue model. Ideally, the intermediate model is more accurate than the prior version of the model, since the machine learning algorithm 608 has used adversarial examples generated specifically to help the machine learning algorithm 608 learn at least one additional generalization that can be used to classify input samples correctly. A determination is made whether the training is adequate at step 720. This may include, for example, the processor 120 of the electronic device 101 or server 106 determining whether the current intermediate dialogue model is sufficiently accurate (such as based on an F.sub.1 or other score). If not, the process returns to step 708 to perform another iteration of the training process with more additional training data. Otherwise, the current model is output as a trained machine learning model at step 722. This may include, for example, the processor 120 of the electronic device 101 or server 106 providing the current model as the final machine learning model 610, which can be used in a desired dialogue-based natural language application.
(104) Although
(105) Although this disclosure has been described with example embodiments, various changes and modifications may be suggested to one skilled in the art. It is intended that this disclosure encompass such changes and modifications as fall within the scope of the appended claims.