Welcome to source{d} bi-weekly, a newsletter with the latest news, resources and events related to Code as Data and Machine Learning on Code. Sign up for source{d} bi-weekly newsletter.

Assisted Code Review with source{d} Lookout

Ensuring that a codebase is consistent in style is both hard and costly, yet it is extremely important for maintainability and to reduce technical debt. This problem is one of the many pain points we are currently tackling with source{d} Lookout, our brand new assisted code review framework.

The purpose of source{d} Lookout is to bring assisted code review to anyone in an easy-to-setup, easy-to-use, easy-to-extend fashion. To achieve that, source{d} Lookout watches Github repos and triggers a set of analyzers when new code is sent for review or pushed. Watch the video recording and slides.

source{d} News

MSR Interview #1: Abram Hindle [Blog]
by Victor Coisne

This article is the first episode of our MSR Interview blog series. This week, we’re publishing the interview of Abram Hindle, a Professor in the Computer Science department at the University of Alberta, Edmonton, AB and part of the program committee for the MSR’18 Technical papers.

Scale By The Bay 2018: Machine Learning on Source Code [Video]
by Francesc Campoy

In this video, Francesc talks about the Machine Learning techniques that can be applied to source code, embeddings over identifiers, structural embeddings over source code, answering the question how similar are two fragments of code, recurrent neural networks for code completion and future direction of the researches in this field.

Swivel Algorithm and its applications [Slides]
by Konstantin Slavnov

These are the slides from a talk Konst gave at the Moscow Python meetup group about Machine Learning applied to source code.

The Road to Semantic Indexing: An introduction to the Kythe project and schema [Video]
by Michael Fromberger

Kythe is an open-source mostly-language-agnostic semantic indexing schema, based on an Google-internal project called Grok that was founded by Steve Yegge in 2008. Kythe was released on GitHub as an open source project in early 2015. Since 2017, Kythe has provided semantic cross-references for Google's internal codebase—spanning millions of lines of code across over a dozen different languages as well as for the open-source Chromium project.

Community News

Source{d} turns code into actionable insights [Podcast]
by the Changelog

Adam caught up with Francesc Campoy at KubeCon + CloudNativeCon 2018 in Seattle, WA to talk about the work he’s doing at source{d} to apply Machine Learning to source code, and turn that codebase into actionable insights.

Looking Back at Google’s AI Research Efforts in 2018 [Blog]
by Jeff Dean

In this blog, the author highlights just some of Google's AI efforts from 2018 including fundamental computer science research results and publications, the application of our research to emerging areas new to Google (such as healthcare and robotics), open source software contributions and strong collaborations with Google product teams, all aimed at providing useful tools and services.

The Adverse Effects of Code Duplication in ML Models of Code [Research Paper]
by Miltiadis Allamanis

In this paper, the author studies the effect of code duplication to machine learning models showing that reported metrics are sometimes inflated by up to 100% when testing on duplicated code corpora compared to the performance on de-duplicated corpora which more accurately represent how machine learning models of code are used by software engineers.

Automatically assessing vulnerabilities found in compositional analysis [Research Paper]
by Saahil Ognawala, Ricardo Nales Amato, Alexander Pretschner, Pooja Kulkarni

In this paper, the authors present a framework to analyze vulnerabilities discovered by an existing compositional analysis tool and assign CVSS3 (Common Vulnerability Scoring System v3.0) scores to them, based on various heuristics such as interaction with related components, ease of reachability, complexity of design and likelihood of accepting unsanitized input.

Neural Code Comprehension: A Learnable Representation of Code Semantics [Research Paper]
by Tal Ben-Nun, Alice Shoshana Jakobovits, Torsten Hoefler

In this paper, we propose a novel processing technique to learn code semantics, and apply it to a variety of program analysis tasks. In particular, we stipulate that a robust distributional hypothesis of code applies to both human- and machine-generated programs. Following this hypothesis, we define an embedding space, inst2vec, based on an Intermediate Representation (IR) of the code that is independent of the source programming language.


January 25th: source{d} Paper reading club (Online)

January 26th: source{d} talk at DevFest (San Francisco, CA)

February 2-3: source{d} talks at FOSDEM (Brussels, Belgium)

February 2nd: source{d} beer Payback (Brussels, Belgium)

February 8th: source{d} Paper reading club (Online)

February 11th: source{d} talk at GopherCon Israel (Tel Aviv, Israel)

Featured Community Member

Abram Hindle is a professor in the CS department at the University of Alberta and part of the program committee for the MSR’18 Technical papers. He researches software engineering, mining software repositories, software process recovery and Green Mining. Check out his website to see his impressive list of papers and projects. Make sure to follow Tim on twitter @abramh to stay up to date with his latest publications and projects.