Welcome to source{d} bi-weekly, a newsletter with the latest news, resources and events related to Code as Data and Machine Learning on Code. Sign up for source{d} bi-weekly newsletter.

An Analysis of the Kubernetes codebase

The Kubernetes community was in Seattle last week for the biggest CNCFcon / KubeCon ever. With 8000 attendees and a very impressive list of sponsors, it seems obvious that the Kubernetes project has moved beyond the hype to widespread enterprise adoption. To confirm this assumption and identify emerging trends, we decided to use source{d} Engine to retrieve and analyze all the Kubernetes git repositories through SQL queries. Learn More.

Number of public APIs in the Kubernetes project over time

source{d} News

Knowledge Graphs, Sequence Translation and Machine Learning on Code [Blog]
by Victor Coisne

This month, we've organized our 2nd in person MLonCode Meetup in partnership with Neo4j and METIS. David Mack from Octavian.ai first gave a talk on how to get started with Machine Learning on graphs while Francesc gave a talk on Machine Learning on Code.

Announcing the schedule for the GO and MLonCode FOSDEM devRooms [Blog]
by Francesc Campoy

At source{d} we’re big fans of FOSDEM. We like that conference so much that for the 2nd year we’re organizing two FOSDEM Devrooms and flying in the whole company to attend in person. In case you missed it, the schedule for the Go and MLonCode Devrooms are now live!

The Case for Data-Driven Open Source Development [Blog]
by Eiso Kant

Every year the number of Open Source companies and developer communities continues to grow. Open Source is becoming the de facto standard for software development as companies realize the cost, agility and innovation benefits. There is, however, one major problem that needs to be addressed: the lack of standardized metrics, datasets, methodologies and tools for extracting insights from Open Source projects.

Machine Learning on Code in the Open Source Show [Video]
by Francesc Campoy

In this video, Francesc talks about ML-assisted code review (Lookout) and the Public Git Archive. You’ll learn how and why source{d} makes uses a dataset based on many GitHub repos available as public datasets to train its models and how “assisted code reviews” apply ML, image processing, and NLP concepts – like word2vec – to code.

Community News

source{d} in the top 10 open source technologies of 2018 [Article]
by Joseph Tsidulko

source{d} Engine takes a Code as Data approach turning lines of codes into actionable insights and source{d} Lookout for Machine Learning on Code and assisted code review.

Getting started with Machine Learning on Graphs [Blog]
By David Mack

In this blog, David shares ressources and approaches get started with Machine Learning on graphs. He showed a system that is able to take an English language question, convert it into Cypher using a neural network, then run that query against a Neo4j graph database to produce an answer.

Improving Automatic Source Code Summarization via Deep Reinforcement Learning [Research Paper]
by Yao Wan, Zhou Zhao, Min Yang, Guandong Xu, Haochao Ying, Jian Wu and Philip S. Yu

Code summarization provides a high level natural language description of the function performed by code, as it can benefit the software maintenance, code categorization and retrieval. In this paper, the authors incorporate an abstract syntax tree structure as well as sequential content of code snippets into a deep reinforcement learning framework (i.e., actor-critic network).

Recurrent Neural Network for Code Clone Detection [Research Paper]
by Arseny Zorin and Vladimir Itsykson

Code clones are a duplicated code which degrades the software quality and hence increases the maintenance cost. Many researches have investigated different techniques to automatically detect duplicate code in programs exceeding thousand lines of code. In this paper, the authors propose an AI-based approach for detection of method-level clones in Java projects.

Kubernetes open-source project matures as commercialization accelerates [Article]
by James Kobielus

The Kubernetes open-source platform shows signs of maturing. According to recent source-code analysis by source{d}, the core Kubernetes codebase, currently in version 1.13, is stabilizing. The total number of contributions to the core Kubernetes project has slowed down in 2018. Commit velocity is decelerating for the core Kubernetes projects.


January 11: source{d} Paper reading club (Online)

January 16: Assisted code review with source{d} Lookout (Online)

January 25: source{d} Paper reading club (Online)

February 1-2: source{d} talks at FOSDEM (Brussels, Belgium)

Featured Community Member

David Mack is co-Founder of Octavian, a research lab aiming to create software that can reason over and understand information as well as a human can. Prior to Octavian, David co founded SketchDeck, a Y-Combinator backed technology startup providing design as a service. Check out his website to see his impressive list of talks and projects. Make sure to follow Tim on twitter @DavidHHMack to stay up to date with his latest publications and projects.