Welcome to source{d} bi-weekly, a newsletter with the latest news, resources and events related to Code as Data and Machine Learning on Code. Sign up for source{d} bi-weekly newsletter.

source{d} is Officially Cool (According to Gartner!)

We are excited to announce that we are cool, at least according to the latest Gartner “Cool Vendors in Application Development and Platforms” report.  Learn More.

source{d} News

Data retrieval pipeline at source{d} [Blog]
by Alex Bezzubov and Javier Fontan

Data collection and processing might be less sexy than Machine Learning but nevertheless is crucial for any progress, and it is also something that source{d} as a company was built upon and has invested a lot into. It was briefly highlighted at several conference talks (go-git, gitbase, gitbase indexes). Now is time for a full-length blog post with the details.

What's new in the latest source{d} releases [Slides]
by Victor Coisne

We recently announce source{d} 0.11, 0.12 and 0.13, two releases with lots of new features and performance improvements. From windows support, to port management, C# language support and new SQL querying, there is a lot for you to get excited about. We also discussed why you should care about Engineering Observability and what are some of the top use cases for source{d} in enterprises.

style-analyzer: fixing code style inconsistencies with interpretable unsupervised algorithms [Research Paper]
by Vadim Markovtsev, Waren Long, Hugo Mougard, Konstantin Slavnov, Egor Bulychev

Source code reviews are manual, time-consuming, and expensive. Human involvement should be focused on analyzing the most relevant aspects of the program, such as logic and maintainability, rather than amending style, syntax, or formatting defects. Some tools with linting capabilities can format code automatically and report various stylistic violations for supported programming languages. They are based on rules written by domain experts, hence, their configuration is often tedious, and it is impractical for the given set of rules to cover all possible corner cases. Some machine learning-based solutions exist, but they remain uninterpretable black boxes

How IT conferences can be better for speakers [Blog]
by Vadim Markovtsev

As in many other startups, sometimes you fuse several roles together. In my case, it they are a machine learning engineer (official) and a developer relations grunt (unofficial). source{d}'s policy for public speaking and otherwise advocacy has always been very encouraging, and our employees stand at the front side of the IT stage quite often. I've personally spoken more than 30 times since I joined in mid-2016. I enjoyed some of those talks, and I went through some bizarre ones. This post reflects my excitements and hiccups, and gives advice to future software conference organizers how to improve their speaker's experience.

Identifying collaborators in large codebases [Research Paper]
by Waren Long, Vadim Markovtsev, Hugo Mougard, Egor Bulychev, Jan Hula

The way developers collaborate inside and particularly across teams often escapes management's attention, despite a formal organization with designated teams being defined. Observability of the actual, organically formed engineering structure provides decision makers invaluable additional tools to manage their talent pool. To identify existing inter and intra-team interactions - and suggest relevant opportunities for suitable collaborations - this paper studies contributors' commit activity, usage of programming languages, and code identifier topics by embedding and clustering them.

Community News


Graph Matching Networks for Learning the Similarity of Graph Structured Objects [Research Paper]
by Yujia Li, Chenjie Gu, Thomas Dullien,  Oriol Vinyals,  Pushmeet Kohli

Using abstract interpretation to build a scalable tool from scratch is a daunting engineering task that generally requires a protracted development effort led by an expert. To streamline that process, we built SPARTA, a C++ library of software components for building high-performance static analyzers that can run in a production environment. SPARTA provides the building blocks (a set of components that have a simple API, are highly performant, and can be easily assembled) so an engineer can focus solely on the logic that extracts the desired information from the program.

What the Vec? Towards Probabilistically Grounded Embeddings [Research Paper]
by Carl Allen, Ivana Balaževic and Timothy Hospedales

Word2Vec (W2V) and Glove are popular word embedding algorithms that perform well on a variety of natural language processing tasks. The algorithms are fast, efficient and their embeddings widely used However, despite their ubiquity and the relative simplicity of their common architecture, what the embedding parameters of W2V and Glove learn and why that it useful in downstream tasks largely remains a mystery. We show that different interactions of PMI vectors encode semantic properties that can be captured in low dimensional word embeddings by suitable projection, theoretically explaining why the embeddings of W2V and Glove work, and, in turn, revealing an interesting mathematical interconnection between the semantic relationships of relatedness, similarity, paraphrase and analogy.

Events

June 1st: Machine Learning for Software Engineering (Montreal, Canada)

June 17-19th: ML conference (Munich, Germany)

May 31st: source{d} paper reading club (Online)

June 14th: source{d} paper reading club (Online)

Featured Community Member

Dr. Margaret-Anne Storey is a Professor of Computer Science at the University of Victoria. She holds a Canada Research Chair in Human and Social Aspects of Software Engineering and the Lise Meitner Guest Professorship at Lund University in Sweden. Make sure to follow Margaret on twitter @margaretstorey or visit her website to stay up to date with her latest publications.