Data collection and processing might be less sexy than Machine Learning but nevertheless is crucial for any progress, and it is also something that source{d} as a company was built upon and has invested a lot into. It was briefly highlighted at several conference talks (go-git, gitbase, gitbase indexes). Now is time for a full-length blog post with the details.

Before we begin a small reminder: as with most of what we do at source{d}, all the tools described in this blog post are available as an Open Source software and packaged in source{d}, our end user product.

>  “A story of the Data Retrieval pipeline at source{d}”

Motivation: no Data, no ML

Most of the recent progress on ML and Deep Learning, in particular, is attributed to the fact of having an abundance of data and plenty of computing resources to use for training large Neural Network models. Therefore, having more/better data can be a strong advantage and thus is well worth investing in.

In the field of ML on Code, the data is all Open Source Software in the world and so the task of large scale git repository collection appeared early on and thus Data Retrieval was one of the first teams in the company.

Overall architecture

What would it take to make all public source code in the world accessible to a Researcher?

A full solution could be divided into 2 parts: the “write” and “read” paths or “collection” and “processing”.

This post discusses details of the Collection part that lies at the foundation of the source{d} technology stack and is a bread and butter of the “Data Retrieval team”.

The task of “storing all public source code in the world” can be broken down into several sub-tasks.

Version Control System = Git

As most of the source code we would want to analyze is stored using source version control, along with tons of useful metadata about it — at the lowest level, we have chosen to use Git as de facto standard VCS in Open Source World.

A pipeline, collecting and processing SVC data would belong to a “servwareworld — early on Go language was chosen as the foundation for this work. After several initial experiments with Go binding to the awesome libgit2, we decided to implement a Git protocol and storage format support in pure Go.

The biggest rationale for doing so was the ability to easily put together storage that would keep whole Git repository in memory and process it only once before it hit the disk in the form of the packfile that is expensive to read and decode again.

It took great multiple teams effort to do so and has resulted in a popular OSS project on its own: go-git, now used by multiple large companies around the world.

Find all Git repositories

The first step is to create a list of Git repositories. To gather URLs to all public Git repositories, we have created a tool called Rovers. In essence, each rover (or “provider” how we call it) is a source code hosting site API crawler. So far, we have implemented:

  • github: uses GitHub API to get public repositories.
  • bitbucket: also uses Bitbucket API to get repositories.
  • cgit: uses Bing search to find public cgit repositories.

The data gathered from each provider is stored in a PostgreSQL database for future replay-ability and to get the starting point the next time the tool is started. It also generates and sends a message to a RabbitMQ queue with the URL found. This queue, called mentions queue, is used as the communication path to the next pipeline step.

All the providers are able to wait for new repositories - this way we can leave Rovers running so it feeds the pipeline with newly created repositories.

Fetch Git repositories

Borges is the tool that downloads and archives repositories. It follows a producer-consumer model where the producers generate jobs, and the consumers do the work, in this case downloading and storing repositories.

There are several types of producers:

  • mentions: Generate jobs from the mentions queue that rovers are filling with messages. This is the main way of downloading new repositories.
  • files: Generate jobs from a text file, one git repository URL per line. Useful to download a specific set of repositories.
  • republish: Enqueues again jobs that had problems downloading. It stops after a maximum number of retries is reached.

Both mentions and files producers do two things: create a new entry for the URL in the repositories database assigning it a UUID and create a new message with the previously generated UUID in a job queue, also in RabbitMQ.

There can be several Borges consumers waiting for new jobs in the queue. Each job in the consumer downloads the repository, stores it in the configured filesystem and updates the repositories database with the state, references and their location. If the error is not fatal (didn’t make the downloader crash) the job is returned to the queue and its retry counter is decreased.

The architecture we use to download public repositories is a Rover instance, two Borges producers (mentions and republish) and several Borges consumers.

Deduplicate Git repositories

Experimented with many different formats and tools, takeaway: there is no better on-disk format that .pack files.

So we need to store hundreds of millions on the packfiles. And that would include both - an Initial fetch we discussed above and updates, as the software is not something static and most interesting project keep improving.

Code duplication is a well-established fact at the scale of Github, and this applies not only to the file level but also to the repository level. As packfile is a commit graph format storage - from early on one of the requirements was to be able not to store twice the “forks” and in general, branches that share the same histories in different repositories.

To solve the duplication problem, we use what we call “rooted repositories”. These repositories contain all the objects that are reachable by a specific initial commit. They also contain all the references (branches and tags) from all the repositories contained in it.

In the previous image, we can see the repository “github.com/src-d/go-git” and a fork made in the past called “github.com/mcuadros/go-git” that added a branch with one more commit. In both cases, the initial commit is “bfa09af” so the objects will be stored in the same rooted repository.

It can happen that some repositories contain more than one commit tree that doesn’t share history. In this case, the repository will be stored in several rooted repositories. This is common when the repositories contain a “gh-pages” branch or tools like Gerrit Code Review are used.

You can get more information about rooted repositories in the borges documentation.

Store Git repositories

To store 10s of millions of Git repositories on disk, even for only the immutable part of the information stored in packfiles (mutable one is stored in DB) we would need a distributed file system.

We started off by using a hosted solution from one of the cloud providers but very soon realized that economy of scale work against our use case: storage costs quickly dominate the bill and paying for redundancy and reliability of such storage was just not worth it. After all, in case of failure and if some pack files are missing - it’s very cheap to fetch them again using Borges, given that metadata DB is intact.

That and the fact that having our own infrastructure would allow us to amortize the cost of hardware and eventually just have it “for free” after some time we opt out of the cloud and moved to a bare-metal k8s infrastructure.

So we started looking for a storage option that we could host and where we could store every repository that would take at least 1 file and many of them are quite small. Such data distribution is not very friendly to the file system, so that brought up the need for some better file format.

File format

Instead of storing the repositories directly in the filesystem we use an archive format specifically created for this use. It’s called “siva” and its main advantages are that it’s append-only and that we can access its index at the end of the file. Each time a rooted repository is modified, we append the new or modified files (packfiles, config, and references) and write its index. We do this in an atomic way so these files can be read while it’s being updated.

The go-git library is able to read git repositories directly from the siva files without unpacking. This is possible because go-git uses a filesystem abstraction library called go-billy and there’s one such file system that understands this format.

If you want to know more about siva files, there is also a more detailed blog post describing it.

Distributed file system

Our previous stack used Apache Spark to do source code analysis. At that point, we were using HDFS to store the siva files with repositories as it is the natural choice to use with Apache Spark. Using this file system also enabled us to read locally from the disks while processing the data to maximize throughput.

When we moved to gitbase the advantages we got with HDFS were not as obvious. The local access to the files was no longer possible as gitbase is written in Go and the library used does not allow it. We needed a new storage system with the following requirements:

  • Be able to access siva files locally in a POSIX filesystem
  • Run storage in the same node as gitbase and do not fight for CPU with gitbase
  • Scale to several terabytes
  • Be maintained and mature software project

Another nice-to-have feature was integration with Kubernetes, as it is where we will run both the downloading pipeline and gitbase.

CEPH was discarded from the beginning because of the CPU consumption in the nodes and the fact that files are stripped in smaller chunks when stored in physical disks. This fact also made us discard other filesystems like LizardFS as it won’t be easy for gitbase to access locally the siva files as a whole for processing.

After careful investigation and testing, we’ve chosen GlusterFS. If used without stripping it stores the files as is in the filesystem so it’s straightforward to read them. It also allows us to add new disks and let the filesystem take care of rebalancing the data and using all the available storage. As a bonus the Kubernetes integration is already made so spinning up new pods with borges and gluster storage is only a matter of configuration.

Currently, borges accesses the GlusterFS using a Fuse mount, just like any other POSIX filesystem, to store repositories.

Conclusion

Given the infrastructure above source{d} can store and keep updated 100 million git repositories on a 32 node cluster.

From now on, anyone with a few simple commands can use the same Open Source software to go from a list of repositories or an organization name on Github to the set of siva files on disk, in the same way that would also scale to many machines using modern container infrastructure.

In the further posts, we are going to talk about the Processing part of the pipeline or how simple SQL interface exposes this data for analysis - stay tuned!

Meanwhile, if you have any further questions please feel free to join the community or subscribe to the newsletter to get notified about industry news on ML on Code.

And if you are interested in working on challenges like this - Data Retrieval team is hiring!

--

Javier and Alex

software engineers at source{d}, on behalf of the Data Retrieval team.