For each repository, we store, at least, 5 files (config, 1 packfile, 1 packfile index, HEAD reference and a master reference). This grows, at least, by two (1 incremental packfile and its index) every time with update a repository.
For a fresh fetch of our current ~16M repositories, that adds up to 80M files and 32M more for each update. That is without considering new repositories being created every month.
We store everything in a distributed file system. Currently Google Cloud Storage and we are evaluating HDFS for our bare metal migration. Either way, whenever we need to update or analyze a repository, we need to read most of the repository files. Neither GCS or HDFS are low-latency systems, so the latency of each file fetch from the distributed file system dominates the time to fetch a repository to a computing node.
At this point, it was clear to us that we need to archive repositories. That way, when processing a repository, we would need to fetch a single file from the distributed file system.
our ideal archive format
We started evaluating different archiving formats to use. That should be easy! These are the features we are looking for:
No compression. Most data is contained in packfiles, which are already compressed.
Seekable. We need random access to arbitrary positions of the archived files.
Indexed. Fast access to specific files.
Cheap Add. With every update, we need to add two files to the archive, so ideally, we should be able to add these files to the archive without completely rewriting it.
Concatenable. Ideally, it should be possible to add new files to the archive with a single append operation. We could use GCS compose or HDFS append. Since most of the data in a repository is in immutable files, we do not need to clean up older files.
Here is a table with a comparison of different archive formats with respect our requirements.
Our first thought was we could extend or abuse some features in an existing format to make it more suitable for us.
Being tar the most widely used archiving format in UNIX-like systems, we decided to give it a chance.
Let’s have a look at the tar file structure, in a simplified way:
A tar archive is essentially a sequence of files. Each file contains a header, padding (files are stored in blocks of 512 bytes) and file content. The end of the tar file is marked with, at least, two zeroed blocks.
It is possible to concatenate multiple tar archives. We just need to ignore the zeroed blocks that are used as EOF markers:
This is supported by GNU tar and FreeBSD tar with a command line switch, but it is not widely supported in other tar implementations, which just ignore everything after the first EOF block.
There is no index that can be used for random access to tar contents. This is natural since tar (*t*ape *ar*chive) was designed to write and read sequentially on tape drives. Given the ubiquity of the tar format, many people have added indexing to tar, either by using a separate file (e.g. tarindexer) or embedding the index in the tar file itself (e.g. rat). The later approach enables us to use the embedded index while preserving compatibility with standard tar tooling, since standard tar will ignore data after the EOF mark. The result would be as follows:
We can combine both indexing and concatenation. The result would be something like this:
Our conclusion is that we could use the tar format with some tricks to fulfill all our needs, and still be compatible with GNU tar or FreeBSD tar. But it would have the following drawbacks:
Adding indexing and concatenation support would mean writing a custom tar implementation for every language we use.
It would not be space efficient: metadata is duplicated both in the index and headers, plus the useless padding. This is not a big issue for us, but if we are going to write our own tar implementations, we would rather avoid cruft and legacy.
Zip seems to come pretty close to our requirements. Let’s see its structure:
In principle, this is close to our tar+index format. However, concatenation is not possible at all without modifying the format in a completely incompatible way.
As an additional drawback, we are not aware of any implementation that provides random access to uncompressed files in a zip archive.
No other format got as close as zip and tar. For us, cpio was essentially tar with less extension possibilities. 7z apparently did not fit either and we did not explore the possibility of extending it, since it is unnecessarily complex for our purposes and implementations in multiple languages are scarce. RAR was discarded since it’s a proprietary format and there are no full open source implementations.
And then there is a myriad of more obscure archiving formats (e.g. mar, zpaq), most of which lack adequate documentation or multiple implementations. We decided that doing the full research to find an adequate format was not going to pay off, since we would need to write our own implementations in multiple languages anyway.
Here it is! The format that has all the features we want: śiva (seekable indexed block archiver). It fulfills all our requirements in an efficient way with minimal storage overhead.
A śiva file is a collection of one or more blocks. Each block can contain zero or more files. There is nothing more than a collection of blocks, no global header or footer. That means it is possible to concatenate two śiva files and the result is a valid śiva file.
Inside each block, the content of all files comes first. Files are simply concatenated without any kind of separator or metadata. After all file contents, there is an index entry for each file and finally the index footer.
When reading a śiva file, implementations must start by reading the index footer of the last block. The full index can be built by jumping backwards from index footer to index footer.
If an entry is repeated, the later occurrence has precedence, so it is possible to perform a logical overwrite. There is also the possibility of adding an index entry marking a file as logically deleted. The combination of logical overwrites and deleted facilitate implementing file system abstractions on top of śiva files.
The format has some interesting properties that we have not explored yet, but are possible to implement without any modification to the format:
Appending a block with a consolidated index for the whole file, so that the reader does not need to jump over all block footers to get the full index.
Add a file to an existing block, overwriting the index. That is, the same add strategy that zip or tar use.
Padding both entries and blocks, to align with blocks of the underlying file system.
The reference implementation of śiva is written in Go. Its API is similar to the standard Go archive/tar API. It has ~450 LOC, significantly less than the standard Go tar (~1300 LOC) and zip (1000 LOC). So it is easy to understand and maintain.