Welcome to source{d} bi-weekly, a newsletter with the latest news, resources and events related to Code as Data and Machine Learning on Code. Sign up for source{d} bi-weekly newsletter.

Deduplicating files in Public Git Archive

This summer, we announced the release of Public Git Archive, a dataset with 3TB of Git data from the most starred repositories on GitHub. Now it’s time to tell how we tried to deduplicate files in the latest revision of the repositories in PGA using our research project for code deduplication, src-d/apollo. Before diving deep, let’s quickly see why we created it. To the best of our knowledge, the only efforts to detect code clones at massive scale have been made by Lopes et. al., who leveraged a huge corpus of over 428 million files in 4 languages to map code clones on GitHub (DéjàVu project). They relied on syntactic features, i.e. identifiers (my_list, your_list, …) and literals (if, for, …), to compute the similarity between a pair of files. PGA has fewer files in the latest (HEAD) revision - 54 million, and we did not want to give our readers a DéjàVu by repeating the same analysis. So we aimed at something different: not only copy-paste between files, but also involuntary rewrites of the same abstractions. Thus we extracted and used semantic features from Universal Abstract Syntax Trees.  Learn More

source{d} News

Paper review: “Learning to Represent Programs with Graphs” by Alexander Bezzubov

“Learning to Represent Programs with Graphs” — a paper from “Deep Program Understanding” group at Microsoft Research was presented presented at ICLR 2018 earlier this year. This work is particularly interesting example of research for a few reasons:

  • has an interesting background, rooted in physics research,
  • explores structured, graph-based representations,
  • includes but goes beyond purely syntactic features,
  • model has official open source implementation (open science!),
  • and this knowledge was actually applied in industry, to build a real product.

Introduction to source{d} Engine and Lookout [Slides] by Eiso Kant

This talk is an intro presentation of both source{d} Engine and source{d} Lookout. Combining code retrieval, language agnostic parsing, and git management tools with familiar APIs parsing, source{d} Engine simplifies code analysis. source{d} Lookout is a service for assisted code review that enables running custom code analyzers on GitHub pull requests.

Machine Learning on Code DevRoom at FOSDEM 2019 by Alexander Bezzubov

With recent advances in the AI/ML field, many new ambitious approaches tackling source code analysis become possible. New projects are started and new papers are published almost daily, so we decided that it’s the right time to step up again and help FOSDEM do what it does best: create focal points for getting together multiple communities in exciting fields. Please join us to enjoy a full day of talks, demos and interesting discussions on applications of Machine Learning to Source Code Analysis.

Comparison of Babelfish with software project alternative [Documentation] by Denys Smirnov

A documentation page where Babelfish is briefly compared with related projects such as Kythe, Language Server Protocol, srclib, ctags, ANTLR, Tree-sitter, srcML and SmaCC. This page highlights the key differences between these different project and provides potential users with a better understanding of the solutions available in the source code parsing landscape.

Community News

source{d} Engine: A Simple, Elegant Way to Analyze your Code [News Article] by Lizzie Turner

With the recent advances in machine learning technology, it is only a matter of time before developers can expect to run full diagnostics and information retrieval on their own source code. This can include autocompletion, auto-generated user tests, more robust linters, automated code reviews and more. I recently reviewed a new product in this sphere -- the source{d} Engine.

Code Vectors: Understanding Programs Through Embedded Abstracted Symbolic Traces [Research Paper] by Jordan Henkel, Shuvendu K. Lahiri, Ben Liblit and Thomas Reps

In this post, researchers from Microsoft Research and University of Wisconsin-Madison use abstractions of traces obtained from symbolic execution of a program as a representation for learning word embeddings.

GitHub wants AI to help developers code [News Article] by Khari Johnson

GitHub is used by more than 30 million developers around the world and hosts repositories for some of the biggest ML-driven open source projects on the planet, but is perhaps less well known for the creation of ML-driven tools to help developers do their jobs. That’s starting to change.

An Empirical Investigation into Learning Bug-Fixing Patches in the Wild via Neural Machine Translation by Michele Tufano, Cody Watson, Gabriele Bavota,  Massimiliano Di Penta, Martin White, Denys Poshyvanyk

In this paper, researchers performed an empirical study to assess the feasibility of using Neural Machine Translation techniques for learning bug-fixing patches for real defects. They first mined millions of bug-fixes from the change histories of GitHub repositories to extract meaningful examples of such bug-fixes.


October 18th: source{d} Paper reading club (Online)

October 20th: Women Who Go workshop (South SF Bay, CA)

November 1st: Machine Learning on Code meetup (San Francisco, CA)

November 2nd: source{d} Paper reading club (Online)

November 7th: source{d} Online Meetup (Online)

November 15th: source{d} talk at Scale by the Bay (San Francisco, CA)

Featured Community Member

Fei-Fei Li is the director of the Stanford Artificial Intelligence Lab and an Associate Professor of Computer Science at Stanford. Her lab has produced some groundbreaking research, including the revolutionary ImageNet dataset and surrounding project. Make sure to follow Miltos on twitter @drfeifei to stay up to date with his latest Machine Learning on Code publications and projects.