We've always given back to the community at source{d}. Our data engineers have done an incredible job at fetching repositories from GitHub and packaging them into something portable and easily usable, so that MLonCode researchers or otherwise interested folks can avoid the nightmare of running a custom Git retrieval pipeline. I am talking about Public Git Archive (PGA), which we have already mentioned in our posts several times: 1, 2, 3. PGA was the main driver of launching src-d/datasets, the special GitHub repository to track down our emerging datasets which are potentially interesting for the externals.

src-d/datasets augments our current datasets collection on data.world. data.world is great for releasing moderately sized CSVs, it allows you to run powerful queries and plot beautiful plots on them. However, some of our datasets are tens of gigabytes to terabytes in size, which renders useless all the perks data.world provides. Thus our solution was to maintain two sites in parallel. Whenever we manage to compile a nice CSV with distilled insights, we push it to data.world and to src-d/datasets; otherwise, only the latter is updated.

We hit the issue with choosing where to store the datasets in the very beginning. These are what we considered:

  • Google Cloud Storage
  • Amazon S3 with Requester Pays
  • Google Drive
  • Serving directly from our servers
  • Sending hard drives, flash sticks or SD cards by physical mail

We did the math and discovered that Google Cloud Storage was prohibitively expensive for our use case. S3 with Requester Pays would have been free, but few people are dedicated to MLonCode so much that they can pay for downloading our datasets. Besides, that requires having an account in Amazon, which may seem too invasive and inconvenient. Google Drive solution is also free because we are already paying for the corporate suite, however, it has limitations on the amount of served data, so publishing terabytes on Google Drive does not look like a good idea. Desperate, we even thought of accepting memory devices by mail, writing data on them and sending back. Eventually, we decided to put PGA behind nginx on our own servers, and dump everything else to Google Drive. This proved feasible and good enough.

We hit another problem which was not obvious at first: releasing new versions of the same dataset was quite unstructured. We've got the ambitious plan to integrate src-d/datasets with DVC - an addon to Git for managing big files from third-party sources. You may have heard about Git LFS; DVC works differently, addresses the main limitations of LFS and explicitly specializes on machine learning. There is an ongoing work to integrate DVC with Google Cloud, which, once merged, will allow us to organize a proper data versioning schema and to hide the extra complexity of working with Google Drive.

Most of the files on Google Drive are compressed with xz. xz provides the best compression ratio among the widespread general-purpose compression utilities on Linux at the expense of slow performance. There are ready to use parallel variants of xz which leverage many cores in the modern CPUs, e.g. pxz, so the performance problem gets mitigated.

It is important to describe the datasets following the same schema, so that the users are not confused and do not miss valuable details. We try to include the following:

  • Download size
  • Download link
  • General description
  • File structure with a description for each file
  • Format description, as detailed as possible; sample code to load the dataset if the format is unique
  • Suggested dataset use cases
  • Origin - how, when, and from what the dataset was created
  • Source code used to create the dataset
  • Known limitations or bugs of the dataset
  • Separate licenses of the sample code, source data transformation artifacts and the source data

To this date, src-d/datasets contains 8 high-quality items:

Name Description Size
Commit Messages 1.3 billion GitHub commit messages up to March 2019 46 GB
DockerHub Metadata 1.46 million Docker image configuration and manifest files on DockerHub fetched in June 2019 1.4 GB
Code Duplicates 2k Java file and 600 Java function pairs manually labeled as similar or different by several programmers 250 MB
Programming Language Identifiers ~49M distinct identifiers extracted from 10+ programming languages in PGA 1 GB
Public Git Archive 260k+ top-bookmarked repositories on GitHub (early 2019) 6 TB
Review Comments 25.3 million GitHub PR review comments since January 2015 till December 2018 1.5 GB
Structural Commit Features AST diffs for 1.6 million commits in 622 Java repositories on GitHub 1.9 GB
Typos 7375 typos in source code identifier names found in GitHub repositories 1 MB

We are going to introduce these datasets one by one in our future posts.

More about source{d} and MLonCode

This post was written by Vadim Markovtsev. Follow him on Twitter: @vadimlearning.