ML on Code is a rapidly developing field, both in academia and industry, that source{d} was set out to systematically explore throughout the last years. So far the results published by our Data Retrieval, Machine Learning, and Infrastructure teams who collect and store millions of Git repositories were based on large-scale applications of advanced NLP techniques such as: Identifiers Embedding, Topic Modeling and sequence model for Identifier Splitting. Current research avenues, driven by the applications for assisted code review, include models using more structured representations of the source code, based on Universal Abstract Syntax Trees and graphs.

That is why the second paper that is covered here will be on recent advances in program representations suitable for Machine Learning that go beyond syntax and traditional NLP techniques.

Source code as a graph

“Learning to Represent Programs with Graphs” — a paper from “Deep Program Understanding” group at Microsoft Research was presented at ICLR 2018 earlier this year.

This work is particularly interesting example of research for a few reasons:

  • has an interesting background, rooted in physics research,
  • explores structured, graph-based representations,
  • includes but goes beyond purely syntactic features,
  • model has official open source implementation (open science!),
  • and this knowledge was actually applied in industry, to build a real product

We’ll summarize briefly the paper itself in the next sections, but first some background on the main Machine Learning model used in the paper — “Gated Graph Neural Networks”. 🕵

History: chemistry -> message passing -> ML on Code

In the official model’s repository README the authors share that the inspiration for this work comes from another field of ML research: Quantum Chemistry.

Interestingly enough, in a few recent talks, Jeff Dean while talking about most exciting advances in ML applications mentions Quantum Chemistry as well:

ML application in quantum physics. Source: Jeff Dean slides

The fundamental idea is well-understood — given a Schrödinger equation, one can get information about the state of a single particle, thus solving it for composition would allow to model properties of more complex structures, including molecules and in general, solid state matter. But solving the many-body Schrödinger equation requires huge computational efforts.

Instead, a Nobel prize winning modeling approach called Density Functional Theory or DFT can be applied, reducing the problem of many-body interacting system to a series of single-body problems and although still slow, it is highly valuable for many tasks in physics, chemistry and material science. Many approximation methods have been developed to make this more feasible by getting estimates instead of exact answers.

Neural Networks for predicting properties of molecules

Many DFT calculation software simulators already exist, but are slow and still computationally intensive. In 2014 those simulators were used to build a reference dataset — QM9, suitable for appling supervised learning algorithms.

So in 2017, several researchers at the Google Brain Residence program spent time applying Neural Networks for predicting properties of molecules on that dataset:

  • a new “featurization” method was proposed, for looking at molecular structure as a graph: atoms as nodes and bonds as edges
  • a new variation of the Gated Graph Neural Network architecture was proposed, particularly suited for summarizing properties of such graphs

Every such graph representing a molecule could be treated as a “computational graph”, thus the usual Neural Network training techniques could be applied to build node embeddings and given such a model the desired properties of the molecule could be learned in a supervised fashion, as a function of the whole graph.

MPNN illustration from “Neural Message Passing for Quantum Chemistry”

The blog post titled “Predicting properties of molecules” by Google Research dives deeper into details, but this work, in particular, has several valuable meta-lessons to teach on conducting a novel research in applied Machine Learning:

  • a single, shared benchmark QM9 was used (based on DFT, previous simulation approach),
  • a systematic assessment of existing machine learning methods on the QM9 benchmark was conducted, and a new featurization method was proposed in the first paper “Machine learning prediction errors better than DFT accuracy”,
  • a general model family of “Message Passing Neural Networks” (MPNNs), with a particular model that improves results by the factor of ~4 was proposed in the second paper on “Neural Message Passing for Quantum Chemistry”,
  • research was not only a leaderboard-driven - a high-level interpretation/hypothesis was also provided: models that can leverage inherent symmetries in data will tend to generalize better

Paper highlights

Now back to “Learning to Represent Programs with Graphs”

A new “featurization” of code was proposed: a “program graph”, or a single unified graph containing AST + data flow + types information.

model architecture: GG-NN (code)

data: 2.9m LoC, 29 big projects (data)

evaluation on tasks:

  • new task proposed: VarMissuse, benchmarked against BiRNN baseline
  • old task used VarName, benchmarked against previous models


  • 32.9% accuracy on the VarNaming
  • 85.5% accuracy on the VarMisuse task
  • real bugs in OSS projects fixed

constraints: statically typed languages, C# (only a subset, buildable with dotnet/roslyn)

A brief summary

Graph construction

A “program graph” with syntax information, data-flow information, and type information was introduced.

Graph structure illustration from ‘Learning to Represent Programs with Graphs’

The program graph consists of an syntactic information from AST plus semantic information about data and control flows, using 10 different types of edges (contributes proportionally to runtime complexity):

  • Child/NextToken — edges to model AST on tokens
  • LastRead/LastWrite/ComputedFrom — edges for variables. Model control flow/data flows
  • LastLexicalUse — chain uses of the same variable
  • GuardedBy/GuardedByNegation — enclosing guard expression that uses this variable
  • FormalArgName — connect method call arguments to it’s name/type declaration
  • ReturnsTo — links return tokens to name/type in method declaration

Model details: GG-NN

The concrete architecture used was [Gated Graph Sequence Neural Networks or GG-NN:

Below is a very brief recap of GG-NN, a recurrent network from a family of “Message Passing Neural Networks”.

Input: a graph, Output: a sequence. Proposed implementation uses GRU, unrolls the recurrence for a fixed number of steps and use truncated backpropagation through time in order to compute gradients.

The idea is:

source: tutorial ‘Representation Learning on Networks’, WWW 2018.
  • an input graph can be treated as “computation graph”
  • messages between nodes, as a way of “summarizing” the neighborhood for every node
  • node embeddings thus can be learned, by propagating messages between connected nodes
  • propagation happens step by step: the first step propagate information from direct neighbors, the second step from nodes 2 steps away, etc
  • as NNs can be deep, to make training stable and avoid exploding/vanishing gradients use same ideas as in RNN: at every step, combine previous state + new input. Initial step concatenated embeddings of “node label” + “node type”.

Initial node state: node name embedding

source: ICLR 2018 poster by MSR
  • average embeddings of the sub-tokens, split by CamelCase and snake_case
  • concatenated with type embedding
  • pass through a linear layer

The authors have also published a reference implementation of the GG-NN model in TensorFlow.

Microsoft/gated-graph-neural-network-samples - Sample Code for Gated Graph Neural Networks

To dive deeper into this method as well as other graph-based learning methods, we recommend checking Stanford’s SNAP tutorial on Representation learning on graphs.


For each of 2 tasks used in this paper a slightly different “program graph” and GG-NN model architecture was proposed.


source: poster at ICLR 2018

VarNaming generates a sequence of identifier sub-tokens, as a function of the whole graph.

That is what authors call an example of “graph2seq architecture”:

  • 8 time steps propagating GG-NN for each var occurrence, starting from the “initial state” described above
  • average of all variable occurrences is input to one-layer GRU, trained with max likelihood, that outputs final name as a sequence of sub-tokens

Accuracy for predicting the exact name and the F1 score for predicting its subtokens is reported.


source: poster at ICLR 2018

VarMisuse is “fill-in box” for variable name: predict which one of the existing variables can be used in a given slot.

A single variable is “blanked out” from the graph by adding a synthetic node. The model is asked to predict the originally used variable, out of the all known vars used.

This task is different from the seemingly close “code completion” task, as it deals only with variables and in “mostly complete” programs.

Training for this task is a bit more involved:

  • compute context representation for each slot, where we want to predict the used variable: insert a new node corresponding to a “hole”, connect it to the remaining graph using all applicable edges that do not depend on the chosen variable at the slot (everything but LastUse, LastWrite, LastLexicalUse, and GuardedBy)
  • then compute usage representation of each candidate variable at the target slot: insert a candidate node and connect it by inserting the LastUse, LastWrite and LastLexicalUse edges that would be used if the variable were to be used at this slot
  • use initial node representations, concatenated with an extra bit that is set to one for the candidate nodes
  • 8 time-steps propagating GG-NN, to get context and usage representation as the final states of those nodes
  • a linear layer that uses the concatenation of context and usage representations
  • train using a max-margin objective

Implementation insights

Few practical insides that were discovered, while building a model in TensorFlow include:

  • use of SparseTensors for representing adjacency list, in order to batch efficiently
  • represent batch-of-graphs as a one single graph \w disconnected components, in order to benefit from GPU parallelization

Summary, as a poster on ICLR 2018 conference by paper authors

CODE: MSR open sourced a generic GG-NN implementation on TensorFlow with example usage “on a simpler task”, so it does not include functions for building the “program graphs” for any of the two tasks above.

DATA: MSR has recently also published a dataset of graphs from the parsed source code used for this paper.

EXPOSITION: MSR members and original paper authors also did a really nice explanatory blog post, in particular on a graph construction part at Microsoft Research Blog.


One thing that makes this research particularly interesting, is to see how a new, revamped Microsoft + Open Source as a company managed to:

Here is an example of one of the gems from previous Build conference: Thinking for Programmers where Leslie Lamport, inventor of Paxos and developer of LaTeX introduces techniques and tools that help programmers think.

For Microsoft, this is not the first attempt to add AI features to its market-leading code editor product: i.e there is another VS extension called Developer Assistant from 2016. But this work, to the best of our knowledge, looks like a first one when such features are grounded in a published scientific research.

Here it is 🍾🍾🍾 for the more companies to follow this path of building products!

By the way, a good work does not need to happen only at the big companies: source{d} is a startup that has also published 3 academic papers over the last year in ML-on-Code field:

If you are interested in working on cutting-edge ML research and putting its results to production as an application for assisted code reviews — please come join us, source{d} is hiring!

This post was written by Alexander Bezzubov. Follow him on Twitter: @seoul_engineer.