While doing topic modeling of GitHub repositories, to which we dedicated the recent article Topic Modeling of GitHub Repositories, we gradually realized that our dataset should be filtered to achieve better results. Particularly, we should throw away “extreme forks”, the Large Unidentified Flying Copy-Pasted Codebases. The typical example of such LUCPCs are web site engines, e.g. many people copy Wordpress into their repository and build their open-source blogs. Numerous github.io sites are the same copy-pasted examples. Folks like to learn web programming from books or online manuals copy-pasting the sample boilerplates. We are not saying it is wrong; sometimes it is even inevitable, sometimes git submodules seem a bigger evil in spite of all GitHub’s efforts to support them. Such duplicates are bad for specifically topic modeling since a lot of “garbage” bags are introduced.

We are working with repositories using the bag-of-names model, which treats each codebase as a sparse vector. The number of dimensions equals to the number of unique names occurring in all the source code we’ve got, the values are equal to the corresponding TF-IDF-s (the derivatives from frequencies). Names are extracted using the source highlight library pygments, and additionally refined with several tricky heuristics. For example, consider the following Python snippet:

class WebServer(ServerBase):
  def route(self, path):
    raise NotImplementedError()

It is transformed into

{"web": 1, "server": 2, "base": 1, "route": 1, "path": 1}

Then we apply frequency filtering, TF-IDF, stemming, etc.

We expect the resulting sparse vectors for duplicate repositories to be nearly the same. Of course, some files may contain minor edits and we cannot seek for the exact same bag-of-names; we have to make up something smarter. The problem which we’ve just stated is by no means unique: many Natural Language Processing tasks involve the same challenges. Let’s revise the common approach: thresholding by Jaccard Similarity.



Imagine that we have two sets (in mathematical sense) - A and B . The Jaccard Similarity Measure reflects the extent to which our two sets are similar to each other.

That is, the power of the set intersection divided by the power of set union. The idea is to calculate pairwise J matrix for all our samples, treat it as a mutual similarity graph’s incidence matrix and then pick the cliqueswhich have each edges weight above the fixed threshold. There is the problem: we’ve got too many samples, more than 10 million and we cannot calculate a 10Mx10M matrix in any observable future. Luckily for us there is a nice solution to that problem: MinHash algorithm.

There are several modifications of MinHash-ing, they are basically equivalent to each other. Let’s consider the easiest one. We Single-linked list, folded into a circle to fit. permute the elements of the union of all sets, and enumerate them. We pick the minimum index throughout contained elements for every set.We repeat this procedure multiple times, and take the average of the indices, h, we call h  MinHash”, it is indeed a hash function, but unlike other hashes, it is consistent: similar items tend to yield near hash values. The difference between two MinHash values can be proved to approximate the Jaccard Similarity. There is a good blog post which explains how to apply the MinHash algorithm to find duplicates: On removing duplicates from a set of documents. Basically, we sort all the hash values and scan them using the window of the size which is specially tailored for the tolerated level of false-positives. MinHash has an awesome property: there are always no false-negatives, so we end up with the same result which we would end up in 10Mx10M brute force.

Having calculated all the hashes in the dataset, we can then conduct locality sensitive hashing, an algorithm from Chapter 3, Mining of Massive Datasets. Briefly, we define several hash tables, each for it’s own subhash, depending on the target level of false positives. Same elements will appear in the same bucket; union of the bucket sets across all the hash tables for a specific sample yields all the similar samples. If we’d like to determine the sets of mutually similar samples a.k.a. cliques, we should consider the set intersection instead.


The LSH algorithm is actually pretty fast by design, does not require much memory and works perfectly, even when implemented in Python (datasketch), whereas calculation of the hashes themselves is resource consuming. We’ve developed an efficient Weighted MinHash calculation library MinHashCuda. It allows to offload the heavy-lifting to the GPU(s).

MinHashCuda has two interfaces, C and Python, much like kmcuda, and actually borrowed quite a lot of boilerplate code from kmcuda. I will use the Python API throughout this section. Suppose that we’ve got out dataset in the compressed sparse row format, particularly scipy.sparse.csr_matrix. In order to feed it to MinHashCuda, first you must initialize the hasher object:

import libMHCUDA
# m is our scipy.sparse.csr_matrix
# m.shape[-1] is the dimensionality - number of distinct elements in the weighted sets
hasher = libMHCUDA.minhash_cuda_init(m.shape[-1], 128, seed=1, verbosity=1)

Internally, it will leverage cuRAND, NVIDIA’s stock random generation library, to generate r, c and β, If the precious GPU memory fits those arrays, then you may move along to the hash calculation, otherwise some dimensionality reduction is needed. For example, the Titan X 2016 has 12GB GDDR5X, so the theoretical limit is:

dimensions. Of course, there must be some room for the samples, so practically the maximum dimensionality is about 500 million.

Next, we calculate the hashes. Depending on whether the whole matrix fits into the memory, there are two options:

# fits
hashes = libMHCUDA.minhash_cuda_calc(hasher, m)

# does not fit
hashes = []
for i in range(0, m.shape[0], chunk):
  hashes.append(libMHCUDA.minhash_cuda_calc(hasher, m, row_start=i, row_finish=i + chunk))
hashes = numpy.vstack(hashes)

In the latter case, we avoid using m[i:i + chunk] because it creates the new matrix and introduces much overhead.

Finally, we free all the allocated resources:



As usual, CUDA kernel performance dramatically varies depending on the chosen parallelization scheme, occupancy and the benevolence of the evil overlords. I had to rewrite it several times until I reached a satisfying result.

The naive method of parallelization would be to assign rows to threads on the equivalent basis: each thread takes a row, enters the argmin loop and finally writes the whole hash. There are several problems with this approach:

  1. Sparse rows may be severely unbalanced, e.g. one consists of 10 elements and the other has 100000. CUDA is SIMT and the performance is going to suffer.
  2. There is no caching, every iteration includes reading the whole row and reading the corresponding random values.

The solution to (1) is to split the incoming matrix into row intervals with roughly the same number of elements, not rows. (2) is solved by caching the partial minimum ak and (k∗,tk∗) into the shared memory, so that the row’s elements are read only once. There is no ability to accurately cache the random values as well because different rows refer to arbitrary different column indices. Apart from solving (1) and (2), I had the special pleasure to implement multiple GPUs support.

(1) appeared to be harder than it seems. The point is, (2) sets the limit on the maximum amount of shared memory per block, so the maximum number of elements consumed by each thread is upper bounded. We end up with a discrete optimization problem. It resembles bin packing, but instead of minimizing the number of bins, it focuses on the overall equal load and grouping the bins by CUDA blocks. Here is my solution:

Battle tested

We successfully applied MinHashCuda to find duplicate repositories on GitHub. The size of our dataset was initially 13.6 million but later was filtered down to 9.6 million. To be precise, the matrix was 9624276 x 2422260 with the sparsity 0.00014, which is roughly equivalent to 9624276 x 340. We ran the hash calculation on two Titan Xs (Maxwell), it took 40 minutes. According to my estimation, the achieved speed is 600 times faster than the implementation in datasketch using 12 CPU cores with Intel’s MKL.

The result after LSH with a similarity threshold of 0.9 was about 467,000 duplicate groups with 1.66 million repositories. 110,000 repositories appeared to be *.github.io. Here are the examples:


The complete dataset is published on data.world.

This post was written by Vadim Markovtsev. Follow him on Twitter: @vadimlearning.