Why we chose Advanced Scientific Data Format for ML models

What ASDF is, why it is awesome, why you should probably use it, and how. Why we adopted ASDF in source{d} ML projects.

Keep reading

Cheers to small beginnings!

A closer look at source{d}’s rebranding.

Keep reading

The Public Git Archive Story

We have recently released a large dataset for MLonCode - Public Git Archive (PGA). It contains 182,000 top-starred repositories on GitHub and takes 3 TB on disk. This post tells the story how PGA emerged: why, how, and what’s next.

Keep reading

Splitting Millions of Source Code Identifiers with Deep Learning

Machine Learning team at source{d} wrote another paper this spring which was presented on ML4P workshop in Oxford. It compares different ML models to split source code identifiers into integral parts, e.g. ‘foobar’ is split into ‘foo’ and ‘bar’. This post summarizes our paper.

Keep reading

Paper review: “Learning to Represent Programs with Graphs”

A review of the recent ML-on-Code paper from Microsoft Research.

Keep reading

Deduplicating files in Public Git Archive

We describe how we ran apollo on PGA, in order to find communities of duplicate files.

Keep reading

Paper review: “Lessons from Building Static Analysis Tools at Google”

Review of a recent scientific paper by Google on the experience of building large-scale static analysis tools.

Keep reading

Machine Learning on Git: introducing Hercules v4

Hercules is an open source project started in late 2016 with the goal to speed up collecting line burndown statistics from Git repositories. It has transformed into a general purpose Git repository mining framework with several cool use cases: ownership through time, file and people embeddings, structural hotness and even comment sentiment estimation. This post presents the latest ‘v4’ release of Hercules and gives some insights into how Git works.

Keep reading

Announcing Public Git Archive

Announcing Public Git Archive, the largest dataset of git repositories in the world.

Keep reading

Detecting licenses in code with Go and ML

Detecting the license of an open source projects is harder than it seems. We have created go-license-detector, a Go library and command line application to solve that task.

Keep reading