This summer, we announced the release of Public Git Archive, a dataset with 3TB of Git data from the most starred repositories on GitHub. Now it’s time to tell how we tried to deduplicate files in the latest revision of the repositories in PGA using our research project for code deduplication, src-d/apollo. Before diving deep, let’s quickly see why we created it. To the best of our knowledge, the only efforts to detect code clones at massive scale have been made by Lopes et. al., who leveraged a huge corpus of over 428 million files in 4 languages to map code clones on GitHub (DéjàVu project). They relied on syntactic features, i.e. identifiers (my_list, your_list, …) and literals (if, for, …), to compute the similarity between a pair of files. PGA has fewer files in the latest (HEAD) revision - 54 million, and we did not want to give our readers a DéjàVu by repeating the same analysis. So we aimed at something different: not only copy-paste between files, but also involuntary rewrites of the same abstractions. Thus we extracted and used semantic features from Universal Abstract Syntax Trees.
Detect the connected components in that graph and run community detection analysis on each component (community detection step).
identifiers, such as variable or function names.
literals, e.g. integer or string constant values.
graphlets, a feature being composed of the internal types of UAST nodes and their children.
children, a feature being composed of the node’s type and its quantized number of children nodes.
uast2seq - a sequence of UAST node types extracted with depth-first tree traversal of limited length.
node2vec - a sequence of UAST node types produced from random walks in the tree.
The latter two features are not explicitly defined since they depend on several hyperparameters: maximum length of a feature, maximum number of random walks, etc. We used the maximum length to 3 and 4 for tree traversals and to 2 and 3 for random walks (the output is combined). The maximum number of random walks was 2 and the number of random steps was 10. Those values hold a trade-off between the vocabulary size and the representativeness.
Houston, we have a problem
Sadly, it turned out that extracting those features crashed our cluster many, many times. We faced several issues:
Running Spark on Siva files through jgit-spark-connector led to a massive amount of temporary data during feature extraction on the worker pods of Kubernetes, which were further killed without notice by the cluster manager.
While converting the data to Parquet format to avoid this, the workload imbalance due to funny reposmade individual tasks last incredibly long.
Once the conversion was done, applying TF-IDF to 1TB of cached data on disk led Spark to spend more time on garbage collection then on the actual work.
Finally, after we found a way to apply TF-IDF with manual batching (sic!), Babelfish failed to parse a considerable amount of the files.
And that does not even take into account the continuous refactoring and bug detection in src-d/ml, src-d/jgit-spark-connector, src-d/apollo that any young project faces. We had to adapt rapidly to changes upstream.
Files in PGA
% of all PGA
Ratio processed/PGA, %
We observed that over 65% of distinct features were literals due to high variance, and apparently were relevant enough for most of the files to be retained. Similarly, due to the quantization, the number of distinct children features was minuscule compared to the rest.
Ratio to all, %
The average number of features per file seemed roughly stable for all the considered languages. Features with the most semantic information, uast2seq and node2vec, had the highest count on average. Note: their numbers depend on the chosen hyperparameters, see the previous section.
The rest of the apollo pipeline ran much smoother. Thanks to MinhashCuda the hashing stage took less then a day to finish. Our previous blog post can provide more details about MinhashCuda. It particularly describes the Locality Sensitive Hashing procedure in depth - the same that we employ for the deduplication. See also ekzhu/datasketch. In a few words, we scan the buckets of several hash tables and treat the files that hash to at least one bucket as similar. We can thus easily find the connected components of the similarity graph once we gather the lists of files per bucket. Groups of files which hash to all the same buckets form cliques.
We decided to use two similarity thresholds for the hashing process, a stricter 95% and looser 80%, the same the DéjàVu authors had used in SourcersCC. They yielded the following results after the pairwise similarity graph was split into connected components (CCs):
Count of CCs with 1 file
Count of CCs with >1 file
Avg. files number per CC with >1 file
Maximum files number across all CCs
Second maximum files number across all CCs
How do we interpret these results? Given the number of cliques, especially at the stricter threshold, there are about 40% of files that are nearly exact clones of other files in the corpus. Even though the histograms show below that the majority of the connected components have a small number of files (distributed exponentially), the amount of duplication that entails in GitHub’s most starred repositories is impressive.
% of files in CCs of size bigger than 1 (80%)
% of files in CCs of size bigger than 1 (95%)
Average file count per CC of size bigger than 1 (80%)
Average file count per CC of size bigger than 1 (95%)
We decided to calculate the ratio of the number of unique files and the total number of files. The unique number is calculated by subtracting the sum of the sizes of the connected components from the total. That’s our very rough estimation of the uniqueness in PGA.
unique at 80%, %
unique at 95%, %
👽 Anomalous connected component 👽
As mentioned previously, we hit one huge CC with 878,186 files from all the 5 languages at both thresholds. We looked inside and saw that this CC did not only have similar files, but also was built from cliques of similar files. Only 0.1% of them had files in more than one language at 80% threshold (0.03% at 95% threshold). The files were small and contained very similar identifiers. We also observed that this CC’s size imploded with other hyperparameters which were less biased towards syntactic features. It divided into large monolingual communities then. So our conclusion on the cause was the “hashing noise” and the “butterfly effect” of the “bridges” between the cliques.
The final step of the apollo pipeline is the community detection, which gives us the sense of the number of non-similar files in our corpus. Consider the word ladder game invented by Lewis Carroll which justifies coding for food:
CODE -> COLE -> COLD -> FOLD -> FOOD
The same way CCs with many files suffer from the “bridges” between densely connected clusters of files. Community detection solves this problem nicely assigning files to several clusters with different degree of confidence. We decided to use the igraph’s implementation of the Walktrap algorithm since it runs reasonably fast and the quality is reasonably good on our data. We had two ways to build the graphs on which to run the community detection:
Include the artificial vertices which represent the buckets, files are star-connected to the buckets they were hashed to.
Directly connect each pair of files if they were hashed to the same bucket.
While the second method is ideal, it scales quadratically with the number of files in a bucket, whereas the first one scales linearly. Depending on the chosen similarity threshold, the first method produces graphs with more artificial nodes then the real ones. Therefore we go with the first method at 95% threshold (3 LSH tables) and with the second method at 80% threshold (9 LSH tables, 3x more buckets).
We could not detect communities in the anomalously large CC of 878,186 files at both thresholds because none of the algorithms which we tried ended within one day. Overall, we detected 918,333 communities at 80% threshold and 666,692 at 95% threshold.
This post continues with the visualizations of some of the connected components and the detected communities using Gephy.
Of course, we would have to review each of the found groups of similar files to ensure that they make sense, an impossible task given the number of files.
We manually labelled 2,000 pairs of Java files as almost the same, similar or different using 🐈 src-d/code-annotation tool kindly developed by our Applications team. We sampled them in a special way to cover all possible labels, the labeling process was tricky and funny and we will certainly describe it in a future blog post. We further ran hyperparameter optimization with hyperopt to determine the feature weights, the optimal threshold and the rest of the variables. But the found hyperparameters are biased by our ML team’s opinions and are not necessarily the best for everybody. They are also specific to relatively verbose Java code.
There is one metric however which is likely to be correlated with the similarity: the average ratio of distinct filenames in the detected communities. We believe that files named the same way probably do the same thing. If you measure that ratio with brute force, you are going to face much noise which ruins the whole idea. File names like concatstring.js and concatstrings.js or syntax-update-20.rb and syntax-update-10.rb should not be considered distinct. So we applied FuzzyWuzzy to the sets of file names per community using the minimum ratio parameter equal to 80. The average ratio of distinct file names per group appeared to be below 0.1, which is a clear indicator that our pipeline is indeed sensible.
distinct filenames count
This is just one of the 79,000 connected components with a single distinct Go file name even without FuzzyWuzzy reduction. Quick investigation revealed that they probably come from vendoring different versions of protobuf.
distinct filenames count
Not only did we notice that CCs tended to group files with matching names, we also noticed that there were multiple CCs with variations of the same file name. For example, we found 14 CCs with different blends of jquery.js, e.g. jquery-1.7.1.min.js. Those CCs typically had a small number of communities for their size, however, they clustered by file name to very limited extent.
Are file names so important?
distinct filenames count
This CC does not contain artificial vertices - buckets - because as we wrote in section “Landing” the “all to all” graph builder method is used for the 80% threshold. The number of edges relative to the number of vertices has exploded. In order to show that apollo didn’t group only identical files, the groups of vertices which share the same FuzzyWuzzy file name are indicated on the graph. While vertices with the same name often hashed to the same buckets, some didn’t even end up in the same community. Furthermore, communities tend to group files with distinct names, e.g. look at the two largest communities in the graph.
distinct filenames count
This illustrates another feature - CCs with many files from few projects. Here the CC is composed of Python files stemming from Microsoft Azure SDK for Python, with 23 files located in 2 other Azure projects (the CLI and the IoT SDK). Given the repetitive structure of the SDK, no surprise.
distinct filenames count
This is one of the 4 CCs which count together for 42% of all of the Java files in GoogleAds java-lib. While the other three CCs generally correspond to the single project, ours is more diverse. However, the files are still clustered by project and by company. The red community is made of the GoogleAds files (32% of all files), the light blue is from AWS Java SDK and the yellow to the upper right is AWS Android SDK (the latter two are 23% of all files). Green files are from Plutext docx4j (6% of all files), mixed with eBay’s developer program projects (also 6%). The orange community is devoted to YaviJava (4.5%), and the others represent a wide range of projects from Facebook, Paypal, Apache, etc. Those codebases have much in common, e.g. coding conventions on the structure or the vocabulary of identifiers.
All the data used to prepare this blog post is open and available for download. Depending on whether you wish to repeat all the steps or run different hashing or just look at the groups of similar files, you should download the corresponding parts.