Learning and using programming styles
11243746 · 2022-02-08
Assignee
Inventors
- Georgios Evangelopoulos (Venice, CA, US)
- Olivia Hatalsky (San Jose, CA, US)
- Bin Ni (Fremont, CA, US)
- Qianyu Zhang (Sunnyvale, CA, US)
Cpc classification
International classification
Abstract
Techniques are described herein for using artificial intelligence to “learn,” statistically, a target programming style that is imposed in and/or evidenced by a code base. Once the target programming style is learned, it can be used for various purposes. In various implementations, one or more generative adversarial networks (“GANs”), each including a generator machine learning model and a discriminator machine learning model, may be trained to facilitate learning and application of target programming style(s). In some implementations, the discriminator(s) and/or generator(s) may operate on graphical input, and may take the form of graph neural networks (“GNNs”), graph attention neural networks (“GANNs”), graph convolutional networks (“GCNs”), etc., although this is not required.
Claims
1. A method implemented using one or more processors, comprising: processing an abstract syntax tree (AST) that represents a first source code snippet based on a generator machine learning model of a generative adversarial network (GAN) to generate a synthetic AST, wherein the synthetic AST corresponds to a transformation of the first source code snippet from another programming style to a target programming style via one or more candidate edits, and wherein the generator machine learning model comprises a first graph neural network (GNN); processing the synthetic AST based on a discriminator machine learning model of the GAN to generate style output, wherein the discriminator machine learning model comprises a second GNN trained based on training data that includes source code snippets in the target programming style, some of which are labeled as genuine and others that are labeled synthetic, and the style output indicates that, if the one or more candidate edits were applied to the first source code snippet, the edited first source code snippet would fail to conform with the target programming style; and based on the style output, training the generator machine learning model.
2. The method of claim 1, wherein the discriminator GNN is coupled with a prediction layer.
3. The method of claim 2, wherein the prediction layer comprises a softmax layer or a sigmoid function layer.
4. A method implemented using one or more processors, comprising: processing an abstract syntax tree (AST) that represents a first source code snippet based on a generator machine learning model of a generative adversarial network (GAN) to generate a synthetic AST, wherein the synthetic AST corresponds to a transformation of the first source code snippet from another programming style to a target programming style via one or more candidate edits, and wherein the generator machine learning model comprises a first graph neural network (GNN); processing the synthetic AST based on a discriminator machine learning model of the GAN to generate style output, wherein the discriminator machine learning model comprises a second GNN trained based on training data that includes source code snippets in the target programming style, some of which are labeled as genuine and others that are labeled synthetic, and the style output indicates that the synthetic second source code snippet conforms with the target programming style; and based on the style output, training the discriminator machine learning model.
5. The method of claim 4, wherein the discriminator GNN is coupled with a prediction layer.
6. The method of claim 5, wherein the prediction layer comprises a softmax layer or a sigmoid function layer.
7. A system comprising one or more processors and memory storing instructions that, in response to execution of the instructions by the one or more processors, cause the one or more processors to: process an abstract syntax tree (AST) that represents a first source code snippet based on a generator machine learning model of a generative adversarial network (GAN) to generate a synthetic AST, wherein the synthetic AST corresponds to a transformation of the first source code snippet from another programming style to a target programming style via one or more candidate edits, and wherein the generator machine learning model comprises a first graph neural network (GNN); and cause to be presented, at one or more output devices, output that includes the one or more candidate edits to be made to the first source code snippet; wherein the generator machine learning model is trained using the following operations: processing a training AST that represents a training source code snippet based on the generator machine learning model to generate a training synthetic AST, wherein the training synthetic AST corresponds to a transformation of the first source code snippet from another programming style to a target programming style via one or more candidate edits; processing the training synthetic AST based on a discriminator machine learning model of the GAN to generate style output, wherein the discriminator machine learning model comprises a second GNN trained based on training data that includes source code snippets in the target programming style, some of which are labeled as genuine and others that are labeled synthetic, and the style output indicates whether, if the one or more candidate edits were applied to the training source code snippet, the edited training source code snippet would conform to the target programming style; and based on the style output, training the generator machine learning model.
8. The system of claim 7, wherein the discriminator machine learning model is trained using the following operations: processing the synthetic training AST based on the discriminator machine learning model to generate the style output; and based on the style output, training the discriminator machine learning model.
9. The system of claim 7, further comprising instructions to generate a synthetic source code snippet based on the synthetic AST.
10. The system of claim 9, wherein the output comprises a graphical user interface that conveys whether the synthetic source code snippet conforms to the target programming style.
11. The system of claim 7, wherein the output comprises a graphical user interface that presents one or more edit suggestions corresponding to the one or more candidate edits to be made to the first source code snippet.
Description
BRIEF DESCRIPTION OF THE DRAWINGS
(1)
(2)
(3)
(4)
(5)
(6)
(7)
DETAILED DESCRIPTION
(8)
(9) Code knowledge system 102 may be configured to perform selected aspects of the present disclosure in order to help one or more clients 110.sub.1-P to manage one or more corresponding code bases 112.sub.1-P. Each client 110 may be, for example, an entity or organization such as a business (e.g., financial institute, bank, etc.), non-profit, club, university, government agency, or any other organization that operates one or more software systems. For example, a bank may operate one or more software systems to manage the money under its control, including tracking deposits and withdrawals, tracking loans, tracking investments, and so forth. An airline may operate one or more software systems for booking/canceling/rebooking flight reservations, managing delays or cancelations of flight, managing people associated with flights, such as passengers, air crews, and ground crews, managing airport gates, and so forth.
(10) Code knowledge system 102 may be configured to leverage knowledge of multiple different programming styles in order to aid clients 110.sub.1-P in imposing particular programming styles on their code bases 112.sub.1-P. For example, code knowledge system 102 may be configured to recommend specific changes to various snippets of source code as part of an effort to align the overall code base 112 with a particular programming style. In some implementations, code knowledge system 102 may even implement source code changes automatically, e.g., if there is sufficient confidence in a proposed source code change.
(11) In various implementations, code knowledge system 102 may include a machine learning (“ML” in
(12) In some implementations, code knowledge system 102 may also have access to one or more programming-style-specific code bases 108.sub.1-M. In some implementations, these programming-style-specific code bases 108.sub.1-M may be used, for instance, to train one or more of the machine learning models 106.sub.1-N. In some such implementations, and as will be described in further detail below, the programming-style-specific code bases 108.sub.1-M may be used in combination with other data to train machine learning models 106.sub.1-N, such as other programming-style-specific code bases 108 to jointly learn transformations between programming styles.
(13) In various implementations, a client 110 that wishes to enforce a programming style on all or part of its code base 112 may establish a relationship with an entity (not depicted in
(14)
(15) Beginning at the top left, a style-B code base 208.sub.1 may include one or more source code snippets 230.sub.1 written in a particular programming style (“B” in this example) that is different than a target programming style (“A” in this example). For example, each source code snippet 230.sub.1 may be obtained from a particular library, entity, and/or application programming interface (“API”). Each of style-B source code snippets 230.sub.1 may comprise a subset of a source code file or an entire source code file, depending on the circumstances. For example, a particularly large source code file may be broken up into smaller snippets (e.g., delineated into functions, objects, etc.), whereas a relatively short source code file may be kept intact throughout processing.
(16) At least some of the style-B source code snippets 230.sub.1 of code base 208.sub.1 may be converted into alternative forms, such as a graph or tree form, in order for them to be subjected to additional processing. For example, in
(17) In addition to the top pipeline in
(18) The style-B ASTs 232.sub.1 and the style-A ASTs 232.sub.2 may then be used, one at a time and/or in batches, to train a generative adversarial network (“GAN”) that includes a generator 240 and a discriminator 242. Generator 240 and/or discriminator 242 may take various forms, which may or may not be the same as each other. These forms may include, but are not limited to, a feed-forward neural network, a GNN, GANN, GCN, sequence-to-sequence model (e.g., an encoder-decoder), etc.
(19) In some implementations, generator 240 is applied to style-B ASTs 232.sub.1 to generate what will be referred to herein as “edit output.” Edit output is so named because it may indicate one or more edits to be made to the style-B source code snippet 230.sub.1 under consideration to conform with programming style A. Depending on the configuration of the machine learning model(s) used for generator 240, this edit output may take various forms. In some implementations, including that in
(20) In other implementations, the edit output generated by generator 240 may take the form of an edit script identifying one or more edits of the style-B source code snippet 230.sub.1 that would transform style-B source code snippet 230.sub.1 from programming style B to style A. These edits may be implemented automatically or may be suggested for implementation to one or more programmers. In yet other implementations, the edit output may take the form of a latent space embedding, or feature vector. In some such implementations, the feature vector may then be applied as input across a decoder machine learning model (not depicted) that is trained to decode from latent space embeddings into style-A source code.
(21) Meanwhile, and referring back to
(22) During training, the style output generated by discriminator 242 may be provided to a training module 244, which may be implemented using any combination of hardware or software. Training module 244 may be configured to compare the style output to label(s) associated with the upstream input data. For example, during training, generator 240 or another component may label its edit output as “synthetic” or something similar. Meanwhile, style-A AST(s) 232.sub.2 may be labeled as “genuine” or something similar.
(23) Training module 244 may compare these labels to style output generated by discriminator 242 for respective training examples. If the style output indicates that a particular training example (i.e., a particular synthetic style-A AST 234) conforms to programming style A but is actually associated with a label identifying the training example as “synthetic,” then discriminator 242 has been “fooled.” In response, training module 244 may train discriminator 242 as shown by the arrow in
(24) By contrast, suppose the style output from discriminator 242 indicates that a particular training example (i.e., a particular synthetic style-A AST 234) labeled as “synthetic”—i.e., it was generated by generator 240—does not conform with programming style A. This means the attempt by generator 240 to “fool” discriminator 242 failed. In response, training module 244 may train generator 240 as shown by the arrow in
(25) After generator 240 and discriminator 242 are trained with a sufficient number of training examples, generator 240 may be adept at generating synthetic style-A AST(s) 234 that are virtually indistinguishable by discriminator 242 from “genuine” style-A AST(s) 232.sub.2. And discriminator 242 may be adept at spotting all but the best imitations of style-A AST(s). In some implementations, generator 240 may be usable moving forward to generate edit output that can be used to transform style-B source code snippets to style-A source code snippets. For example, the edit output may include an edit script with one or more proposed or candidate changes to be made to the style-B source code snippet, a style-A AST that can be converted to a style-A source code snippet, etc. Discriminator 242 may be usable moving forward to, for instance, notify a programmer whether their source code conforms to a target programming style.
(26) As noted previously, in some implementations, generator 240 and/or discriminator 242 may be implemented using machine learning models that operate on graph input. With GNNs, for example, ASTs 232.sub.1-2 may be operated on as follows. Features (which may be manually selected or learned during training) may be extracted for each node of the AST to generate a feature vector for each node. Nodes of each AST may represent a variable, object, or other programming construct. Accordingly, features of the feature vectors generated for the nodes may include features such as variable type (e.g., int, float, string, pointer, etc.), name, operator(s) that act upon the variable as operands, etc. A feature vector for a node at any given point in time may be deemed that node's “state.” Meanwhile, each edge of the AST may be assigned a machine learning model, e.g., a particular type of machine learning model or a particular machine learning model that is trained on particular data.
(27) Then, for each time step of a series of time steps, feature vectors, or states, of each node may be propagated to their neighbor nodes along the edges/machine learning models, e.g., as projections into latent space. In some implementations, incoming node states to a given node at each time step may be summed (which is order-invariant), e.g., with each other and the current state of the given node. As more time steps elapse, a radius of neighbor nodes that impact a given node of the AST increases.
(28) Intuitively, knowledge about neighbor nodes is incrementally “baked into” each node's state, with more knowledge about increasingly remote neighbors being accumulated in a given node's state as the machine learning model is iterated more and more. In some implementations, the “final” states for all the nodes of the AST may be reached after some desired number of iterations is performed. This number of iterations may be a hyper-parameter of the GNN. In some such implementations, these final states may be summed to yield an overall state or embedding of the AST.
(29) When generator 240 is implemented using a GNN, the overall state or embedding of the AST may be applied as input across one or more additional machine learning models and/or other processing streams to generate a synthetic style-A AST 234 and/or style-A source code. For example, an encoder-decoder network, or “autoencoder,” may be trained so that an encoder portion generates a latent space embedding from an input AST or source code, and a decoder portion translates that latent space embedding back into the original input. Once such an encoder-decoder network is trained, the decoder portion may be separated and applied to the latent space embedding generated by the GNN used for generator 240 to generate a style-A AST 234 and/or style-A source code. In some implementations in which discriminator 242 is implemented at least in part using a GNN, the GNN may be coupled with a prediction layer, e.g., a softmax layer or a sigmoid function layer, that outputs yes or no (or one or zero, or a probability) based on the latent space embedding generated by discriminator 242.
(30) As noted previously, when jointly learning transformations between two programming styles, it may be unlikely that the code bases underlying each programming style can be aligned perfectly, or even approximately, into pairs for training. For example, a particular function in one code base may not necessarily have an equivalent in the other code base. Accordingly, in some implementations, techniques such as CycleGAN may be employed to facilitate relatively (or entirely) unsupervised learning of generator/discriminator pairs for each programming style. With such techniques it is possible to learn domain transformations between the two programming styles without requiring paired training data.
(31)
(32) A second GAN at bottom includes a B2A generator 340.sub.2 and a style-A discriminator 342.sub.2. B2A generator 340.sub.2 may be trained to operate on programming style-B input (e.g., source code snippet(s), AST(s), etc.) to generate edit output that is indicative of changes to be made to the style-B input to transform it to programming style A. Style-A discriminator 342.sub.2 may be trained to classify input (e.g., source code snippet(s), AST(s), etc.) as conforming or not conforming to programming style A.
(33) Similar to
(34) In addition, as indicated by the dashed arrow from A2B generator 340.sub.1 to B2A generator 340.sub.2, synthetic style-B AST(s) generated by A2B generator 340.sub.1 may be conditionally applied as input across B2A generator 340.sub.2. This conditional application may turn on the style output of style-B discriminator 342.sub.1. If the style output of style-B discriminator 342.sub.1 indicates that the synthetic style-B AST conforms to programming style-B (i.e., style-B discriminator 342.sub.1 has been “fooled”), then the synthetic style-B AST may be applied as input across B2A generator 340.sub.2 to generate a synthetic style-A AST, which may then be applied as input across style-A discriminator 342.sub.2.
(35) Similarly, as indicated by the dashed arrow from B2A generator 340.sub.2 to A2B generator 340.sub.1, synthetic style-A AST(s) generated by B2A generator 340.sub.2 may be conditionally applied as input across A2B generator 340.sub.1. This conditional application may turn on the style output of style-A discriminator 342.sub.2. If the style output of style-A discriminator 342.sub.2 indicates that the synthetic style-A AST conforms to programming style-A (i.e., style-A discriminator 342.sub.2 has been “fooled”), then the synthetic style-A AST may be applied as input across A2B generator 340.sub.1 to generate a synthetic style-B AST, which may then be applied as input across style-B discriminator 342.sub.1. Thus, a training cycle is formed that enables joint learning of transformations between programming styles A and B without having paired data.
(36) Techniques described herein may be utilized to provide programmers, e.g., operating client devices 110.sub.1-P, with tools that facilitate conformance with target programming styles. These tools may be provided, for instance, as features or plugins associated with a software development tool. These tools may enable programmers to see whether their source code conforms to a target programming style (e.g., one color of text may indicate conforming code whereas another color of text may indicate non-conforming code), to receive suggestions as to how their source code can be modified to conform to the target programming style (e.g., for training purposes), and/or to automatically transform their source to the target programming style.
(37)
(38) In this example, some snippets, such as RECONCILE_DEBIT_ACCOUNTS.CC, RECONCILE_CREDIT_ACCOUNTS.CC, and ACQUISITION_ROUNDUP.PHP conform to programming style A. The remaining source code snippets do not. In other examples, rather than simply indicating whether or not a source code snippet conforms to programming style-A, a probability or grade that indicates how well the source code snippet conforms to programming style-A may be provided. An interface such as 450 may allow a programmer to focus on those source code snippets that do not yet conform to the target programming style.
(39) In some implementations, the programmer may be able to select a source code snippet from GUI 450 to receive more specific information about why the selected source code snippet doesn't conform to the target programming style. For example, in some implementations, by clicking a non-conforming source-code snippet, the programmer may be presented with a list of potential edits that can be made to the source code snippet to bring it into conformance with the target programming style.
(40)
(41)
(42) At block 502, the system may apply data associated with a first source code snippet, such as the source code snippet itself, an AST generated from the source code snippet, or a latent space embedding generated from the snippet or from the AST (e.g., using a GNN), as input across a generator machine learning model to generate edit output. In various implementations, the edit output may be indicative of one or more edits to be made to the first code snippet to conform to a target programming style.
(43) At block 504, the system may apply data indicative of the edit output as input across a discriminator machine learning model to generate style output. As noted previously, the discriminator machine learning model may be trained to detect conformance with the target programming style. At block 506, the system may determine whether the style output indicates that the edit output conforms to the target programming style.
(44) If it is determined at block 506 that the edit output conforms to the target programming style, then method 500 may proceed to block 508, at which point the next training example is selected. However, if at block 506 the system determines that the style output indicates nonconformance of the edit output with the target programming style, then method 500 may proceed to block 510. At block 510, the system, e.g., by way of training module 244, may train the generator machine learning model, e.g., using techniques such as gradient descent, back propagation, etc.
(45)
(46) At block 602, the system may apply data associated with a first source code snippet, such as the source code snippet itself, an AST generated from the source code snippet, or a latent space embedding generated from the snippet or from the AST (e.g., using a GNN), as input across a generator machine learning model (e.g., 240, 340.sub.1-2) to generate edit output. This edit output may take the form of an edit script, synthetic AST, a latent space embedding, etc.
(47) Based on the edit output generated at block 602, at block 604, the system may generate a synthetic second source code snippet. For example, if the edit output generated at block 602 was an AST, generating the synthetic second source code snippet may be a simple matter of converting the AST to source code using known techniques. In other implementations in which the edit output comprises a latent space embedding, the latent space embedding may be applied across a trained decoder machine learning model to generate source code output.
(48) At block 606, the system may apply data indicative of the synthetic second source code snippet as input across a discriminator machine learning model (e.g., 242, 342.sub.1-2) to generate style output. As described previously, the discriminator machine learning model may be trained to detect conformance with the target programming style. Thus, the style output may indicate that the synthetic second source code snippet conforms to, or does not conform to, the target programming style.
(49) Based on the style output generated at block 606, at block 608, the system (e.g., by way of training module 244) may train the discriminator machine learning model, e.g., using techniques such as gradient descent, back propagation, etc. For example, if the discriminator classifies the synthetic second source code snippet as genuine, that may serve as a negative training example for the discriminator. By contrast, if the discriminator correctly classifies the synthetic second source code snippet as synthetic, that may serve as a positive training example.
(50)
(51) User interface input devices 722 may include a keyboard, pointing devices such as a mouse, trackball, touchpad, or graphics tablet, a scanner, a touchscreen incorporated into the display, audio input devices such as voice recognition systems, microphones, and/or other types of input devices. In general, use of the term “input device” is intended to include all possible types of devices and ways to input information into computing device 710 or onto a communication network.
(52) User interface output devices 720 may include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices. The display subsystem may include a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), a projection device, or some other mechanism for creating a visible image. The display subsystem may also provide non-visual display such as via audio output devices. In general, use of the term “output device” is intended to include all possible types of devices and ways to output information from computing device 710 to the user or to another machine or computing device.
(53) Storage subsystem 724 stores programming and data constructs that provide the functionality of some or all of the modules described herein. For example, the storage subsystem 724 may include the logic to perform selected aspects of the method of
(54) These software modules are generally executed by processor 714 alone or in combination with other processors. Memory 725 used in the storage subsystem 724 can include a number of memories including a main random access memory (RAM) 730 for storage of instructions and data during program execution and a read only memory (ROM) 732 in which fixed instructions are stored. A file storage subsystem 726 can provide persistent storage for program and data files, and may include a hard disk drive, a floppy disk drive along with associated removable media, a CD-ROM drive, an optical drive, or removable media cartridges. The modules implementing the functionality of certain implementations may be stored by file storage subsystem 726 in the storage subsystem 724, or in other machines accessible by the processor(s) 714.
(55) Bus subsystem 712 provides a mechanism for letting the various components and subsystems of computing device 710 communicate with each other as intended. Although bus subsystem 712 is shown schematically as a single bus, alternative implementations of the bus subsystem may use multiple busses.
(56) Computing device 710 can be of varying types including a workstation, server, computing cluster, blade server, server farm, or any other data processing system or computing device. Due to the ever-changing nature of computers and networks, the description of computing device 710 depicted in
(57) While several implementations have been described and illustrated herein, a variety of other means and/or structures for performing the function and/or obtaining the results and/or one or more of the advantages described herein may be utilized, and each of such variations and/or modifications is deemed to be within the scope of the implementations described herein. More generally, all parameters, dimensions, materials, and configurations described herein are meant to be exemplary and that the actual parameters, dimensions, materials, and/or configurations will depend upon the specific application or applications for which the teachings is/are used. Those skilled in the art will recognize, or be able to ascertain using no more than routine experimentation, many equivalents to the specific implementations described herein. It is, therefore, to be understood that the foregoing implementations are presented by way of example only and that, within the scope of the appended claims and equivalents thereto, implementations may be practiced otherwise than as specifically described and claimed. Implementations of the present disclosure are directed to each individual feature, system, article, material, kit, and/or method described herein. In addition, any combination of two or more such features, systems, articles, materials, kits, and/or methods, if such features, systems, articles, materials, kits, and/or methods are not mutually inconsistent, is included within the scope of the present disclosure.