Welcome to source{d} bi-weekly, a newsletter with the latest news, resources and events related to Code as Data and Machine Learning on Code. Sign up for source{d} bi-weekly newsletter.

Import2vec - Learning Embeddings for Software Libraries [Paper Review]

Dependencies is becoming crucial to managing complexity, reducing technical debt, ensuring compliance, addressing security issues and many other critical tasks for most software companies.Bart Theeten, Frederik Vandeputte and Tom Van Cutsem, at Nokia Bell Labs in Antwerp, Belgium, have recently published a paper at MSR 2019 to help solving this problem.

source{d} News

source{d} EE: The data platform for the Software Development Life Cycle [Ebook]
by Victor Coisne

source{d}’s premise is that enterprises should have visibility across their SDLC to enable better decision making. The source{d} platform discovers, extracts, transforms and loads this data to be quickly analyzed with modern Data Science and Machine Learning techniques.

code2seq: generating sequences from structured representations of code [Blog]
by Alex Bezzubov

Today we are going to look at code2seq: Generating Sequences from Structured Representations of Code. This is the latest work of the research group from Technion university that in two previous publications—A General Path-Based Representation for Predicting Program Properties and code2vec: Learning Distributed Representations of Code—came up with a different approach to build program representations suitable for machine learning: sampling paths in ASTs.

MSR Interview #7: Massimiliano Di Penta​​​​​​​ [Blog]
by Victor Coisne

The seventh episode of source{d} MSR Interview blog series are out. This time we interview Massimiliano Di Penta, associate professor at the Department of Engineering, the University of Sannio in Benevento.

source{d} datasets, a blog series [Blog]
by Vadim Markovtsev

We've always given back to the community at source{d}. Our data engineers have done an incredible job at fetching repositories from GitHub and packaging them into something portable and easily usable, so that MLonCode researchers or otherwise interested folks can avoid the nightmare of running a custom Git retrieval pipeline.

Community News

Program Understanding Synthesis & Verification with Graph Neural Networks [Slides]
by Alex Polozov

Graph-structured representations are widely used as a natural and powerful way to encode information such as relations between objects or entities, interactions between online users (e.g., in social networks), etc.  Learning and reasoning with graph-structured representations is gaining increasing interest in both academia and industry, due to its fundamental advantages over more traditional unstructured methods in supporting interpretability, causality, transferability, etc

Program Synthesis & Semantic Parsing with Learned Code Idioms [Research Paper]
by Richard Shin, Miltiadis Allamanis, Marc Brockschmidt, Oleksandr Polozov

In this work, the authors present PATOIS, a system that allows a neural program synthesizer to explicitly interleave high-level and low-level reasoning at every generation step. It accomplishes this by automatically mining common code idioms from a given corpus, incorporating them into the underlying language for neural synthesis, and training a tree-based neural synthesizer to use these idioms during code generation.

Modeling Vocabulary for Big Code Machine Learning [Research Paper]
by Hlib Babii, Andrea Janes, Romain Robbes

This paper lists important modeling choices for source code vocabulary, and explores their impact on the resulting vocabulary on a large-scale corpus of 14,436 projects. We show that a subset of decisions have decisive characteristics, allowing to train accurate Neural Language Models quickly on a large corpus of 10,106 projects.


September 6th: source{d} paper reading club (Online)

September 16th: UseDataConf (Moscow, Russia)

September 19-20th: Open Core Summit (San Francisco, US)

October 9-11th: DevFest (Nantes, France)

Featured Community Member

Massimiliano Di Penta is who is an associate professor at the Department of Engineering, the University of Sannio in Benevento (Italy). He is a member of the IEEE, of the IEEE Computer Society, and of the ACM. Make sure to follow Massimiliano on Twitter or visit his website to stay up to date with his latest publications.