Welcome to source{d} bi-weekly, a newsletter with the latest news, resources and events related to Code as Data and Machine Learning on Code. Sign up for source{d} bi-weekly newsletter.

The Public Git Archive StoryPublic Git Archive is the result of months of efforts curating a dataset suitable for training Machine Learning on Source Code (aka MLonCode) models. It contains 182,000 top-starred repositories on GitHub and takes 3 TB on disk. The repositories were cloned in February-March 2018. Check out the announcement post for more information. You should check out Engine which allows to run SQL queries on top the PGA and do other cool things. Learn More.

source{d} News

Why we chose advanced scientific data format for ml models [Blog]
by Vadim Markovtsev

Advanced Scientific Data Format (ASDF) is a next generation serialization format for scientific data. This means that it focuses on storing sparse and dense tensors in an efficient way. The ASDF project was started by Perry Greenfield (astropy), Michael Droettboom (matplotlib; astropy) and Erik M. Bray (astropy) at SpaceTelescope Institute in 2014.

Celebrating FLOSS in ML at FOSDEM 2019 [Blog]
by Alex Bezzubov

Applying ML for finding patterns in source code is a cutting-edge research topic in academia and industry known as ML on Code. But there can not be any ML without data and that is where FLOSS software projects present a trove of opportunities.

MLonCode San Francisco Meetup recap [Blog]
by Victor Coisne

On November 1st, we hosted our first Machine Learning on Code San Francisco meetup at Holberton School. We had a pretty good turnout for a first event and got really feedback from participants. Here is a quick post including talk abstracts, slides and video recordings!

The Road to Semantic Indexing: An introduction to the Kythe project [Video]
by Michael Fromberger

Today, large-scale semantic indexing is not (yet) widely available, and indeed has only relatively-recently become practical at all. The Kythe project is one attempt to address this: Kythe is an open-source mostly-language-agnostic semantic indexing schema, based on an Google-internal project called Grok that was founded by Steve Yegge in 2008.

Cheers to small beginnings! [Blog]
by Ricardo Baeta

In this article we cover the process that sparkled source{d} rebranding, what we achieved so far, and what is the road ahead of us.

Community News

Search-Based Generalization and Refinement of Code Templates [Research Paper]
by Tim Molderez and Coen De Roover

Several tools support code templates as a means to specify searches within a program’s source code. Despite their ubiquity, code templates can often prove difficult to specify, and may produce too many or too few match results. In this paper, we present a search-based approach to support developers in specifying templates.

The Case for Learned Index Structures [Research Paper]
by Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, Neoklis Polyzotis

In this exploratory research paper, we start from this premise and posit that all existing index structures can be replaced with other types of models, including deep-learning models, which we term learned indexes. The key idea is that a model can learn the sort order or structure of lookup keys and use this signal to effectively predict the position or existence of records.

Advanced Deep Learning & Reinforcement Learning [Video]
by DeepMind

This course, taught originally at UCL and recorded for online access, has two interleaved parts that converge towards the end of the course. One part is on machine learning with deep neural networks, the other part is about prediction and control using reinforcement learning.

Amazon opens its internal machine learning courses to all for free [Article]
by Connie Loizos

AWS just announced that it has made available, for free, the same machine learning courses that it uses to train its own engineers. It’s a lot of information to digest — there are more than 45 hours across 30 different courses that developers, data scientists, data platform engineers and business professionals can take gratis.


November 30th: source{d} paper reading club (Online)

December 5th: source{d} talk at ML conference (Berlin, Germany)

December 6th: source{d} talk at Metis Data Science (San Francisco, CA)

December 8th: source{d} talk at DevFest (Lisbon, Portugal)

December 11-14: source{d} at KubeCon / CNCFcon (Seattle, OR)

December 14th: source{d} paper reading club (Online)

Featured Community Member

Jürgen Cito is a postdoc at MIT CSAIL. He received a PhD in February 2018 at University of Zurich (Switzerland) and was a research intern at IBM Watson Research Center in New York. Prior to his Phd, he received my master's and bachelor's degree in Computer Science from the Technical University of Vienna. Make sure to follow Jürgen on twitter @citostyle to stay up to date with his latest publications and projects.