ML on Code is the field of research that attempts to capture statistical properties/patterns in source code using ML models,and apply those patterns for software engineering tasks/tools. Very much like NLP with text embedding, the question of obtaining useful vector representation of programs is at the heart of ML on Code research.

2018 was a fruitful year for ML on Code: several different approaches for building such representations were published, each producing state-of-the-art result on one of the following tasks: code summarization (predicting function names) and algorithm classification (“Does this function implement a QuickSort?”).

Researchers have moved beyond looking at the code as plain text and are using structured representations of the program from different stages of the compilation process, as is common in more general Programming Languages literature. We previously examined the graph-based approach published by MSR in a review of  Learning to Represent Programs with Graphs where ASTs augmented with types, data-flow, and control-flow information were used as input for a graph-based model called Gated Graph Neural Network (GGNN).

Today we are going to look at code2seq: Generating Sequences from Structured Representations of Code.

This is the latest work of the research group at the Computer Science Faculty of the Technion university that in two previous publications—A General Path-Based Representation for Predicting Program Properties and code2vec: Learning Distributed Representations of Code—came up with a different approach to build program representations suitable for machine learning: sampling paths in ASTs.

This paper is also a good example of reproducible research which contributes to the rise of  publication quality in this area. It includes:

- the paper, published at ICLR'2019,

- the website, with visualization examples at http://code2seq.org,

- the code https://github.com/tech-srl/code2seq, and

- the datasets (C# and Java) https://github.com/tech-srl/code2seq#datasets

The authors have put an impressive effort to make the results “digestible” and reproducible and put it one step forward, towards a web-first, participatory science 👏.

Their efforts have allowed us to create this notebook, so you can get a deeper first-hand experience and play with this model easily and quickly.

Tasks

The family of tasks this paper focuses on can be described as: given a piece of code, generate a sequence of words describing that code. There are different ways to attack this problem, and thus it was broken down by the authors into 3 specific tasks from the literature: Code Summarization, Code Captioning, and Code Documentation.

Code Summarization

This task—also known in the literature as “Method Naming”— is an example of “extreme summarization”: given a function/method body, the model is asked to propose a short, descriptive name for the function.

That is a supervised task where the function names, as ground-truth labels, can be mined from open-source code. The function names are represented as a sequence of case-insensitive sub-tokens, split by regex following 2 common naming conventions. That way setMaxConnectionsPerServer is predicted as the sequence [set, max, connections, per, server]. That is also one of the major differences with their previous code2vec work, where labels in the target vocabulary were not tokenized and thus resulted in much bigger vocabulary size. Previous research done by source{d} ML team also suggest that incorporating more advanced tokenization methods could further reduce the vocabulary size by 60%.

Dataset: java-small, java-med, java-large

Measure: Precision, Recall, F1

Baselines: Models specifically designed for code: ConvAttention (Allamanis et al., 2016), CRFs (Alon et al., 2018), code2vec (prev. work by the same authors) and TreeLSTM (Tai et al., 2015) as well as multiple NMT models: bidi-LSTM and Transformer (Vaswani et al., 2017) were used.

Code Captioning

This task is about predicting a natural language sentence given a short C# code snippet.

Dataset: 66,015 pairs of Q&A from StackOverflow, C# (Iyer et al., 2016)

Measure: BLEU score

Baseline: Some recent structural approaches of CodeNN, TreeLSTMs with attention, as well as seq2seq architectures from NMT of 2-layer bidi-LSTMs with attention, and the Transformer were chosen as a baseline for this task.

Code Documentation

An additional task—generating source code comments from the method-level documentation (JavaDoc)— was added to the paper in the later sections in order to make the results directly comparable to Deep Code Comment Generation. The code2seq architecture is reported to have 62% relative gain of the BLEU score over that prior work.

Datasets

A new Java programming language dataset was collected from Github and published by the authors.

Table 1 from the original paper)

Main Idea

The main idea is quite similar to code2vec: leverage the syntactic structure of programming languages to get a vector representation of the code. This approach is language-independent: after the AST extraction step, the same preprocessing and training will work for different languages, although no comparison of the quality of results across different languages has been published yet.

Architecture

Compared to code2vec, the main difference is that in code2seq, authors used the seq2seq approach as inspiration for the model architecture, so it is formulated in the Encoder/Decoder framework with an attention mechanism.

Figure 3 from the original paper

The training process consists of the following steps:

  • Parse text to AST
  • Extract syntactic paths: terminal-to-terminal paths in the AST (capped at a maximum length of 9)
  • Create a Bag of Contexts: sample N such paths (N=200), and use them for further steps
  • Model Encoder
  • learn embeddings:  for paths, using LSTM hidden state; for terminal nodes, as a sum of sub-token embeddings
  • Combine those to a single vector: concatenate and apply a fully connected layer
  • Model Decoder
  • Attention-based: look at all N paths + prev. hidden state
  • Predicts next sub-token (capped at a maximum length of 6)

Although this learned representation could also be further used as pre-training for downstream tasks, to the best of our knowledge no results were reported on the performance of doing that for any such additional tasks so far.

Evaluation

For the Code Summarization task of method name generation, the code2seq model outperforms the baselines on precision/recall/F1 metrics.

On top of that, all models were compared for data efficiency by reporting results on the 3 different datasets of increasing sizes. The code2seq architecture consistently outperforms all baselines on both small and large datasets.

Method sensitivity to input length was also reported for all the models, with code2seq having the smallest drop in quality for snippets of code up to the length of 30 lines.

For the Code Captioning task where a full natural language sentence should be predicted given a short C# code snippet, code2seq model outperforms previous work with 23.04 BLEU: an improvement of 2.51 points (12.2% relative) over 20.53, set in 2016 by CodeNN.

An ablation study was also conducted in order to understand the relative importance of the model components on the code summarization task, using the validation set:

Caption: Table 3 from the original paper

One can see that the biggest performance improvements on top of having tokens and AST nodes come from a new architecture with its sequential decoder and the token splitting.

Conclusion

Path-based vector representation using compositional AST path encoding looks very promising,  and is being actively researched by multiple groups. The code2vec/seq approach with its Bag Of Context representation is more robust than previous attempts of modeling the syntactic structure of the source code. Until very recently, this approach also held state of the art (SotA) results on several tasks related to code summarization, such as MethodNaming.

However, despite the advantages of being language-independent and requiring only source code parsing (and in particular, not requiring a partial compilation), path-based approaches still seem to require a task-specific architecture and have no clear path to incorporating more structural information beyond syntax e.g. an interprocedural control flow that spans multiple files.

Recently, new studies about Graph Neural Networks have been published—specifically, Structured Neural Summarization—that make it possible to directly compare both paradigms (graph based approaches and path based approaches) on exactly the same dataset. By the end of 2018, GGNN-based architecture set a new SotA result on the Method Naming task.

Based on these results, a graph-based approach, with its capacity for relational inductive bias and ability to incorporate arbitrary relational information, seems to be the most promising direction to a unified architecture capable of handling structured representations of source code on several different tasks. But exploring that idea further is subject for another blog post!

Meanwhile, please stay tuned for more paper reviews! And, in the spirit of web-first participatory science, please go ahead and try to reproduce the results using this notebook containing original code from the authors of the paper.

Acknowledgements: this post would not be possible without the support from source{d} and amazing colleagues. In particular, thanks to Jan Hula and Hugo Mougard from ML team and M. J. Fromberger, who is leading Language Analysis team, for their invaluable feedback!

This post was written by Alex Bezzubov . Follow him on Twitter: @seoul_engineer.

More about source{d} and MLonCode