At source{d}, we are proud sponsors and attendees of conferences for researchers in the fields of software engineering, programming language analysis, and machine learning on source code. One of the conferences that really stands out for us is the Mining Software Repositories (MSR) conference for the quality of the content over the years.

This year, In addition to attending, speaking and sponsoring the conference, Hugo, Waren and I, decided to write blog posts about our favorite research papers presented at MSR 2019. Antoine Petri spoke about the Software Heritage Graph Dataset (SHGD) on the Large-Scale Mining track of the first day. Software Heritage is a non-profit organization based in France which tries to archive all software code in the world - both freely licensed and not. Here are the links to the presentation and the preprint.

SHGD is the largest publicly available dataset of source code metadata to our knowledge. It basically contains everything around the open-source code in the world but the code itself. For example, for every fetched Git repository, there are commits, trees, etc together with all the hashes, but there are no blobs. Even without them, the size of the dataset is already huge: ~1TiB. The mentioned saved entities form a Directed Acyclic Graph (DAG), with every node hashed depending on its children. Such a DAG is called a Merkle DAG, similarly to Merkle Trees that we described in our blog post earlier. Since each node has a hash, the whole graph can be easily deduplicated, and that's exactly what Software Heritage did.

SHGD contains details about 85mm projects, 1.1mm commits, 4.4mm directories and 5mm files. Those constitute a graph with 10m nodes and 100m edges. This means a very sparse graph, so remember about sparse matrices when you decide to experiment. We can calculate the required size of memory to fit the whole graph in Compressed Sparse Row format without the redundant "values" part: (100m + 10m + 1) * sizeof(int64) = 880GB. So you cannot fit the whole graph in memory, unfortunately. The good news is that you probably don't want it anyway.

The sources of the dataset include GitHub and GitLab. The personal information is discarded, so you cannot use it for easy-spamming. Very responsible nowadays, when recruiters abuse Git signatures and GitHub profiles at an unprecedented scale. Besides, this is what GDPR requires you to do, and Software Heritage is registered in France.

The authors propose several use cases for SHGD, such as studying how software evolves or hacking the social graph. Regarding what we do at source{d}, it could help with expanding our own Commit Messages dataset, and serve as a great evaluation benchmark for identity matching.

SHGD is shipped in two formats, PostgreSQL and Parquet, correspondingly for those who favor SQL and Hadoop. Starting exploring the dataset with Parquet is faster because you don't have to wait for the SQL dumps to be loaded into the database. It is also possible to query the cloud service of Software Heritage directly.

According to the metadata in the Parquet files, they were generated with pandas. This is good news: there shouldn't be any issues with Python. However, many of them are big ~1GB unfriendly chunks. Our recommendation for the next version of the dataset is to split Parquet files into evenly sized parts of ~200MB. Anyway, below is a simple example of how to play with parquet_release.tar in Python:

from collections import Counter
from spacy.lang.en import English
import pandas
from tqdm import tqdm

df = pandas.read_parquet("c15c76d7064f48cd851b3c8252919fe0.parquet")
nlp = English()
nlp.add_pipe(nlp.create_pipe("sentencizer"))
words = Counter()
for comment in tqdm(df["comment"]):
words.update(t.text.lower() for t in nlp((comment or b"").decode(errors="ignore")) if t.is_alpha)
print(sorted(((v, k) for k, v in words.items()), reverse=True)[:20])

You should see

[(2276688, 'version'),
(1589114, 'pgp'),
(1498394, 'from'),
(1487342, 'request'),
(1321349, 'merge'),
(1278460, 'pull'),
(1239091, 'sequenceiq'),
(1149600, 'release'),
(987284, 'for'),
(824326, 'to'),
(776710, 'gnupg'),
(764302, 'on'),
(652199, 'linux'),
(640074, 'gnu'),
(594937, 'the'),
(566769, 'fix'),
(526121, 'tag'),
(523117, 'based'),
(503373, 'added'),
(486982, 'package')]

This example takes the release descriptions, tokenizes them using spacy and prints the most frequent words.

Stay tuned for more recap blog posts about our favorite research paper from MSR 2019. If you’d like to receive a notification about the upcoming blog posts by email, we invite you to sign up for source{d} bi-weekly newsletter

More about source{d} and MLonCode

This post was written by Vadim Markovtsev. Follow him on Twitter: @vadimlearning.