Welcome to source{d} bi-weekly, a newsletter with the latest news, resources and events related to Code as Data and Machine Learning on Code. Sign up for source{d} bi-weekly newsletter.

Splitting millions of source code identifiers with Deep Learning

If you grab our Public Git Archive dataset with almost 180,000 Git repositories, take the latest revision of each and extract all the identifiers from them (e.g. variable, function, class names), you will end up with something close to 60 million unique strings. They include “FooBar”, “foo_bar”, and “foobar”-like concatenations of the integral identifiers or “⚛ atoms” as we sometimes call them. We’ve solved some problems which require the number of distinct atoms to be as small as possible for performance and quality considerations; those problems include topic modeling of GitHub repositories, identifier embeddings and even the recent study of files duplication on GitHub. Thus we decided to focus on reducing that number through careful splitting of the initial concatenations. The result was 64% atom vocabulary reduction. Learn More.

The architecture of the BiLSTM network to split identifiers.

source{d} News

Michael Fromberger joins source{d} to lead Language Analysis efforts [Blog]
by Victor Coisne

Michael Fromberger, former senior software engineer at Google, and technical lead/manager for the Kythe open source project (internally named Grok at Google), will join source{d} to lead its Language Analysis efforts.

Submit talks to our FOSDEM 2019 Devrooms [Blog]
by Maartje Eyskens and Alex Bezzubov

For the second year in a row, source{d} is organizing two FOSDEM Devrooms: Go and Machine Learning on Code. Check out this blog post for more information about CfP and how to get involved.

It’s a wrap! source{d}’s participation in Hacktoberfest [Blog]
by Victor Coisne

We’re happy to share that over the past 30 days we’ve seen 112 pull requests opened and 50 merged across all of our repos. These PRs were submitted by approximately 45 unique contributors out of which 13 people won one of our brand new t-shirts!

source{d} Engine in five minutes [Video]
by Francesc Campoy

Check out this 5 minutes video to learn everything you need to know about source{d} Engine and how to easily get started.

Come talk to us at the following events! [Blog]
by Victor Coisne

One of our favorite activities at source{d} is engaging with our users. We invite you to reach out if you are attending one of these events and want to chat with us about Code As Data, Machine Learning on Code or Open Source!

Community News

The Road to Semantic Indexing [Article]
by Michael Fromberger

Semantic indexing is a powerful technique for understanding relationships within code. Like a textual index, a semantic index maps documents (in this case, source files) to terms—but in addition to lexical structures, the terms of a semantic index also include program entities defined by the source language—such as types, functions, and variables.

Google launches AI Hub and Kubeflow Pipelines [Article]
by Khari Johnson

Google Cloud announced the launch of Kubeflow Pipelines to foster collaboration within businesses and further democratize access to artificial intelligence. Kubeflow Pipelines is available for free and is being open-sourced.

Neural Translation Model for Learning Source Code Changes [Research Paper]
by Milan Cvitkovic, Badal Singh, Anima Anandkumar

In this paper, the author propose a novel Tree2Tree Neural Machine Translation system to model source code changes and learn code change patterns from the wild.

Open Vocabulary Learning on Code with a Graph-Structured Cache [Research Paper]
by Saikat Chakraborty, Miltiadis Allamanis, Baishakhi Ray

Code is written using an open, rapidly changing vocabulary. Reasoning over such a vocabulary is not something for which most NLP methods are designed. The authors of the this research paper introduce a Graph-Structured Cache to address this problem.

Getafix: How Facebook tools learn to fix bugs automatically [Blog]
by Johannes Bader, Satish Chandra, Eric Lippert and Andrew Scott

Modern production codebases are extremely complex and are updated constantly. To create a system that can automatically find fixes for bugs — without help from engineers — they built a tool that learns from engineers’ previous changes to the codebase.


November 15th: source{d} talk at Scale by the Bay (San Francisco, CA)

November 16th: Deep Learning for Programming Language Type Inference (Online)

November 17th: source{d} talk at DevFest (Krasnodar, Russia)

November 18th: source{d} talk at GOTO night (Copenhagen, Denmark)

November 21st: source{d} talk at GOTO conference (Copenhagen, Denmark)

November 28th: The Road to Semantic Indexing: An introduction to the Kythe (Online)

December 5th: source{d} talk at ML conference (Berlin, Germany)

December 8th: source{d} talk at DevFest (Lisbon, Portugal)

Featured Community Member

Holden Karau is an Open Source Big Data Developer Advocate at Google, focusing on improvements for Apache Spark focused in Core, ML, and Python. Holden is also an Apache Spark Committer & PMC member. Check out her website to see her impressive list of talks and projects. Make sure to follow Holden on twitter @holdenkarau to stay up to date with her latest publications and projects.