Success! Check your email

Error! submit again

GitHub Contributions Graph: Analyzing PageRank & Proving The 6 Handshakes Theory

source{d} has recently published a dataset with metadata on 462,000,000 commits: data.world. It allows you to build the contributions graph. For example, these are the neighbors around Armin Ronacher:

Armin Ronacher's neighbors

Neighbors around Armin Ronacher, 2 generations - the repositories he contributed to and their contributors. Armin is in the center. 8k nodes, 11k edges. The graph was produced with Gephi.

The fans on the above image are communities around some popular open source projects, e.g. rust-lang/rust on the bottom.

Armin Ronacher's neighbors

Neighbors around Armin Ronacher, 3 generations. Armin is again somewhere in the center. 200k nodes, 300k edges.

These are the neighbors around Rob Pike:

Rob Pike's neighbors zoomed

Neighbors around Rob Pike, 3 generations, zoomed center region. Nodes are highlighted with the heatmap tool to reflect the distance from Rob. Hubs: upper-left: golang/go*, upper-right: onef9day/gowiki, lower-left: cmars/oo, lower-right: cmars/tools.

Rob Pike's children

Schematic abstraction of the previous graph. Edge weights are the number of commits.

Actually, Rob Pike has never contributed to cmars/oo and cmars/tools. The owner of those repos must have used git filter-branch or something similar to forge the history. This is why it is so hard to distinguish the real contributions in Git world!

Adjacency Matrix

The contributions graph is bipartite and is represented by an extremely sparse adjacency matrix:

Adjacency matrix

The square adjacency matrix of the contributions graph. Cij is the number of commits done by developer i to repository j; Cij = Cji.

I continue mining the October 2016 snapshot, at that time the matrix was about 23 million by 23 million and contained 48 million non-zero elements. Of course, these numbers are approximate - we depend on our identity matching here. The identity matching is the way to merge several email addresses into a single personality and it is not an easy task because we have to make assumptions. The public dataset has hashes instead of email addresses so it is impossible to perform the identity matching on that data. You can download the graph here. It is a pickled scipy.sparse.csr_matrix, the following Python code loads it:

import pickle
with open("graph_blog_post.pickle", "rb") as fin:
    graph, repos = pickle.load(fin)
ndevs = graph.shape[0] - len(repos)
print(repr(graph))
print("Number of developers:", ndevs)
print("Number of repositories:", len(repos))
inv_repos = {r: i + ndevs for i, r in enumerate(repos)}
name = "src-d/go-git"
print("Repository %s has %d contributors, %d commits" % (
    name, len(graph[inv_repos[name]].indices), graph[inv_repos[name]].sum()
))
<23223056x23223056 sparse matrix of type '<class 'numpy.float32'>'
	with 47866516 stored elements in Compressed Sparse Row format>
Number of developers: 6621684
Number of repositories: 16601372
Repository src-d/go-git has 14 contributors, 169 commits

We ship only the second part of the matrix index, the repository name → index mapping. Please note the following:

  1. We were unable to process some large repositories back in October. Particularly, about 20% of the most highly rated ones. That is, unfortunately, there is no data for git/git, torvalds/linux, rails/rails, etc.
  2. Some secondary repositories were confused with the main ones. E.g. there is no golang/go but rather some random 4ad/go. That was a bug we’re now fixing.
  3. Initially we had 18M repos, but filtered 1.5M duplicates not marked as forks; read this blog post how.

Let’s conduct two funny experiments with our graph:

The Handshake Theory

Big Bang Handshake

The six handshakes theory, or Six degrees of separation, states that everybody in the world can reach out everybody else using the chain of sequentially familiar people of the size smaller than or equal to 6. Supposing that you know all the people who contributed to the same projects you contributed yourself, we can accept or reject this assumption on GitHub contributions graph.

The plan will be as follows.

  1. Find all the connected components of the graph.

  2. Pick the core component, or simply “the core”. Calculate the size of the representative sample of the pairs of nodes.

  3. Calculate the distances between sampled node pairs.

  4. Plot the histogram, draw the conclusion.

We need to determine the connected components because it is impossible to find a path between nodes lying in different components and every shortest path algorithm has to scan all the nodes before returning the negative result. We’ve got 23M dots, remember.

Connected components

While our graph is directed, every edge has the corresponding backward edge of the same weight, so the weak connectivity automatically means strong connectivity. In other words, we do not need to apply complex algorithms designed to find the strongly connected components, but rather conduct the series of graph traversals. For example, the following code works:

def RFS(v_start):
    visited = set()
    pending = {v_start}
    while pending:
        v = pending.pop()
        visited.add(v)
        if len(visited) % 500000 == 0:
            print(len(visited))
        for i in range(graph.indptr[v], graph.indptr[v + 1]):
            nv = graph.indices[i]
            if nv not in visited:
                pending.add(nv)
    return visited

unvisited = set(range(graph.shape[0]))
components = []
while unvisited:
    v = unvisited.pop()
    c = RFS(v)
    components.append(c)
    if len(components) % 100000 == 0:
        print(len(components))
    unvisited -= c
print("Number of connected components:", len(components))
clens = list(sorted((len(c) for c in components), reverse=True))
print(clens[:10])

The traversal algorithm is neither depth-first nor breadth-first. It is random-first as Python’s set is an unordered container. Random-first traversal is sufficient for our task of visiting every connected vertex once and is faster. It takes about 5 minutes to execute. The result is:

Number of connected components: 3430141
[10200912, 10229, 8456, 2917, 2910, 2614, 2612, 2604, 2576, 2139]

We’ve the clear core with 10M nodes! How many developers are in it?

c0 = components[0]
core_devs = [m for m in c0 if m < ndevs]
print(len(core_devs))
2153559

That is, the “contributional active” GitHub users are 2.2M or 33% or ⅓ of the whole users we analysed, here we analysed about 73% of all the users who made at least 1 commit, and the official number of users is reported to be 20M (including users without any commits). Thus the final ratio is 11%.

Representative sample

We need to determine how many distances between random nodes in the core must be evaluated to achieve a statistically significant distribution. Let’s find a rough estimate. The number of possible distance pairs between the core developers is \(\frac{N_ {core} (N_ {core} - 1)}{2}\).

print("Number of possible pairs:", len(core_devs) * (len(core_devs) - 1) / 2)
Number of possible pairs: 2318907106461

Now we should use the special calculator to find out the size of the representative sample from 2.3 trillion. For example, checkmarket.com. We set the population size to 2318907106461, the margin of error to 1% and the confidence level to 95% and see roughly 10,000. Not bad! This is how to sample 10k random vertices:

samples = numpy.random.choice(core_devs, 10000 * 2, replace=False).reshape((10000, 2))

Distances calculation

There are many shortest path algorithms out there. I can recall the following three from scratch:

  1. Floyd–Warshall allows to calculate the distances from everybody to everybody but has a tiny drawback - it works in O(V3) time and eats O(V2) memory. We cannot afford ourselves to store 23M × 23M = 5e14 elements (2 PB) and wait several years.

  2. Good ol’ Dijkstra allows to find all the shortest distances from one fixed node to all the rest in O(|E| + |V|log|V|) time and O(|V|) space. It works but it is not efficient on our graph. The problem is with the “generational explosion”, each next node generation is exponentially larger than the previous one, so Dijkstra’s algorithm ends up with inspecting almost all the nodes every time.

  3. Bidirectional search spreads simultaneous waves from the both nodes and suits our task best since this algorithm is less sensitive to the “generational explosion”.

Here is the code I used. It leverages the multiprocessing package to calculate many shortest paths in parallel.

def path_length(first, second):
    visited_first_set = set()
    visited_second_set = set()
    visited_first_paths = {}
    visited_second_paths = {}
    pending_first = [(first, 0)]    
    pending_second = [(second, 0)]
    pending_first_set = {first}
    pending_second_set = {second}
    pmax = graph.shape[0]
    path = pmax

    # Breadth-first search single step
    def step(visited_set, visited_paths, pending, pending_set):
        v, p = pending.pop(0)
        pending_set.remove(v)
        visited_set.add(v)
        visited_paths[v] = p
        for i in range(graph.indptr[v], graph.indptr[v + 1]):
            nv = graph.indices[i]
            if nv not in visited_set and nv not in pending_set:
                pending.append((nv, p + 1))
                pending_set.add(nv)

    while (path == pmax) and (pending_first or pending_second):
        if pending_first:
            step(visited_first_set, visited_first_paths, pending_first, pending_first_set)
        if pending_second:
            step(visited_second_set, visited_second_paths, pending_second, pending_second_set)
        isect = visited_first_set.intersection(visited_second_set)
        for v in isect:
            p = visited_first_paths[v] + visited_second_paths[v]
            path = min(path, p)
    return path / 2

from multiprocessing import Pool
with Pool() as p:
    paths = p.starmap(path_length, samples)

The script took less than 10 minutes to finish on our 32-core machine with 256 gigs of RAM. Each process takes less than 4GB of RAM which means 32 processes need less than 128GB.

We divide each path by 2 because the nodes corresponding to repositories should not be taken into account in the shortest path’s size calculation.

The histogram

Histogram

85% of the core is less than or equal to 6 handshakes from each other. The average is 4.9, the maximum is 18. The average is close to what Facebook had in 2011. ∎

GitHub PageRank

Google As soon as we have the contributions graph, we want to find out whose GitHub is bigger. The natural way of doing this is to calculate some centrality measure. We decided to study eigenvector centralities which make nodes important if they are referenced by other important nodes.

PageRank graph

Armin Ronacher’s contributions, simplified PageRank version.

$$ x_ {AR} = \frac{w_ 1 x_ 1 + w_ 2 x_ 2 + w_ 3 x_ 3 + …}{\lambda} = \frac{1}{\lambda} \sum w_ i x_ i $$

\(w_ i\) depend on the impact to the project or to the ratio of the developer’s efforts. If our adjacency matrix contained strictly positive elements (everybody is connected to everybody) then by Perron–Frobenius theorem \(\vec{x}\) is the eigenvector of this matrix which corresponds to the greatest eigenvalue. Moreover, if the matrix is stochastic, that is, the sum of values in every column equals to 1, \(\lambda=1\). The greatest eigenvalue and the corresponding eigenvector can be found by various efficient methods such as power iteration or Arnoldi iteration.

Unfortunately, this is not applicable to our case: we’ve got a sparse matrix with a lot of zeros. Practically speaking, the greatest eigenvalue corresponds to an eigenvector with negative elements. We need to do something. Luckily, we are not the only ones who hit this problem. So did Larry and Sergey back in 1998. They studied the similar adjacency matrix of WWW pages. Each element \(C_ {ij}\) is 1 if page i links to page j and 0 otherwise. Larry and Sergey found the nice trick to make the adjacency matrix well-conditioned and suitable for * iteration methods which they called PageRank.

There is a good explanation of PageRank in Stanford’s CS246. I am not going to repeat it but rather briefly describe how the trick applies to our centrality problem using the terms from CS246.

  1. We normalize each column to 1: \(\displaystyle \sum_ i{C_ {ij}} = 1\). Now the matrix is not symmetric and \(C_ {ij} \neq C_ {ji}\), however, it becomes stochastic.
  2. We multiply all the elements by \(\beta\) and add \(\frac{1-\beta}{N} E\) where \(E\) is N by N matrix filled with 1 (not the identity matrix!). \(\beta\) is the so-called dampening or random walk factor and can be set to 0.85. Our matrix becomes acyclic and irreducible, aka the Google matrix.
  3. We run power iteration and find PageRank values for every node.

Normalization

The following code performs the L1 normalization.

norms = 1 / numpy.asarray(graph.sum(axis=1)).ravel()
from scipy.sparse import diags
graph_normed = H = graph.dot(diags(norms, format="csc"))

Here is what happens: $$ norms_ i = \frac{1}{\sum\limits_ j C_ {ij}} $$
$$ diags = \left( \begin{array}{cccc} norms_ 0 & 0 & … & 0 \\
0 & norms_ 1 & … & 0 \\
… & … & \ddots & … \\
0 & 0 & … & norms_ {N-1} \end{array} \right) \\
$$
$$ H=\left( \begin{array}{cccc} C_ {0,0} & C_ {0,1} & … & C_ {0,N-1} \\
C_ {1,0} & C_ {1,1} & … & C_ {1,N-1} \\
… & … & \ddots & … \\
C_ {N-1,0} & C_ {N-1,1} & … & C_ {N-1,N-1} \end{array} \right)\times diags= $$ $$ =\left( \begin{array}{cccc} \frac{C_ {0,0}}{norms_ 0} & \frac{C_ {0,1}}{norms_ 1} & … & \frac{C_ {0,N-1}}{norms_ {N-1}} \\
\frac{C_ {1,0}}{norms_ 0} & \frac{C_ {1,1}}{norms_ 1} & … & \frac{C_ {1,N-1}}{norms_ {N-1}} \\
… & … & \ddots & … \\
\frac{C_ {N-1,0}}{norms_ 0} & \frac{C_ {N-1,1}}{norms_ 1} & … & \frac{C_ {N-1,N-1}}{norms_ {N-1}} \end{array} \right) $$ scipy.sparse API makes the diagonal matrix multiplication the only way to normalize the columns.

The Google matrix

$$ G = \beta H + \frac{1-\beta}{N}\left( \begin{array}{cccc} 1 & 1 & … & 1 \\
1 & 1 & … & 1 \\
… & … & \ddots & … \\
1 & 1 & … & 1 \end{array} \right) $$ We are not going to ever calculate it explicitly because that would involve storing N × N elements! Instead, we will create a custom power iteration algorithm.

Power iteration

The idea of the power iteration method is dead simple: if we want to find the vector \(\vec{x}\) such that \(G\cdot \vec{x} = \vec{x}\) then we repeat the same matrix multiplication until the convergence: $$ x_ {i+1} = G\cdot x_ i $$ Provided that the matrix is well-conditioned, the convergence is guaranteed. Here is the code which calculates \(\vec{x}\).

def page_rank(m, beta=0.85, niter=80):
    N = m.shape[0]
    x = numpy.ones(N, dtype=numpy.float32) / N
    for i in range(niter):
        x_next = m.dot(x) * beta
        x_next += (1 - beta) / N  # ***
        xdiff = numpy.linalg.norm(x - x_next, ord=1)
        x = x_next
        print("iter #%d: %f" % (i + 1, xdiff))
        if i % 10 == 0:
            print(x)
    return x

page_rank(graph_normed)

# *** every edge has the opposite one, so there are no “dead ends”. Otherwise, we would have to replace beta with x_next.sum().

page_rank() executes fast and the result is returned within a minute.

Who is the most important on GitHub?

Ryan Baumann aka ryanfb! He is eventually a good example how to study how GitHub’s cache works (spoiler: it invalidates after some time).

Unicorn

github.com/ryanfb takes several seconds to be generated.

ryanfb's contributions

Ryan knows what it means to be productive.

I am joking, of course. Although he is on the top, he is actually the living illustration of the weaknesses PageRank algorithm has. There used to be days when Ryan created 1,000 repositories and of course he is not a super human - they were automatically generated. Yet he got many incoming links and PageRank rated him the highest.

Pepperidge Farm Remembers

The essential move would be to ignore repositories with a single contributor, and that definitely helps, though other fun effects still reflect the light. After all, only the ratio of PageRank-s makes sense. For example, after the mono repository filtering, Armin Ronacher has 8.0e-6, our CTO Maximo Cuadros has 2.6e-6 and I have 1.6e-6.

What is the most important on GitHub?

Here is the descending top 200:

  1. bmorganatlas2/firstrepo
  2. nguyendtu/patchwork
  3. bmorganatlas/fusiontest5
  4. coeligena/homebrew-customized-copy
  5. KenStanley/reflections
  6. athurg/linux_kernel
  7. enkidevs/commit
  8. gentoo/wikiclone
  9. karmi/wikipedia_metal_umlaut
  10. Homebrew/homebrew-core
  11. renoirb/test
  12. SopraConsulting/CocoapodsSpecs
  13. openSUSE/salt
  14. caskroom/homebrew-cask
  15. borisyankov/DefinitelyTyped
  16. neuros/linux-davinci-2.6
  17. openstack/openstack
  18. TheOdinProject/curriculum
  19. NixOS/nixpkgs-channels
  20. levaidaniel/sbo
  21. RobertCNelson/device-tree-rebasing
  22. markphip/testing
  23. mutoso-mirrors/linux-historical
  24. Xilinx/u-boot-xlnx
  25. laijs/linux-kernel-ancient-history
  26. wikimedia/mediawiki-extensions
  27. Ningxiaobao/learngit
  28. FFmpeg/FFmpeg
  29. rust-lang/rust
  30. xwstrom/ffmpeg-3.0.2
  31. tieto/pidgin
  32. mesa3d/mesa
  33. drewgreenwell/playscript-mono
  34. ThomasGagne/sage-rijndael-gf
  35. chapuni/llvm-project-submodule
  36. lwhsu/freebsd-doc_old
  37. palmzeed/git
  38. yasee/cocos2d-x-custom
  39. illumos/illumos-gate
  40. wbond/package_control_channel
  41. WeichenXu123/wchen-spark
  42. khavnu/VLCLIB
  43. OpenDMM/linux
  44. larryhastings/gilectomy
  45. KDE/kdelibs
  46. EFForg/https-everywhere
  47. PrestaShop/PrestaShop
  48. John-NY/overo-oe
  49. webapproot/metasploit
  50. yiichina/yii2
  51. zmatsh/Docker
  52. jcenteno1973/sicafam
  53. Digital-Peak-Incubator/tpl_tauristar
  54. shibaniahegde/OpenStack
  55. aabudari/nova
  56. futuresimple/ansible-project
  57. instructure/canvas-lms
  58. Shinogasa/django
  59. codecombat/codecombat
  60. wireshark/wireshark
  61. dmgerman/git-test-decl
  62. koding/global.hackathon
  63. zurb/foundation
  64. JCBarahona/edX
  65. paladox/testmw
  66. woothemes/woocommerce
  67. ohmnam/spreee
  68. LCTT/TranslateProject
  69. patchew-project/qemu
  70. 0xbzho/asciinema.org-2015-02
  71. tikiorg/sf.net-users
  72. gentoo/gentoo
  73. fanwenyi0529/qemu-fvm
  74. weissets/happy-navi-osmand
  75. KDE/kdepim
  76. 0xbzho/asciinema.org-2015-04
  77. duckduckgo/zeroclickinfo-goodies
  78. julian-gehring/julia
  79. woodsts/buildroot
  80. citation-style-language/styles
  81. aospSX/platform_kernel_msm7x30
  82. mjudsp/Tsallis
  83. NICHO1212/laravel
  84. voidlinux/void-packages
  85. skillcrush/skillcrush-104
  86. jruby/jruby
  87. rdunning0823/tophat
  88. ctekhub/atom-t
  89. Azure/azure-quickstart-templates
  90. linzhangru/ROS
  91. maurossi/llvm
  92. forcedotcom/aura
  93. buckett/sakai-gitflow
  94. Tower-KevinLi/Designers-Learn-Git
  95. silverstripe/silverstripe-framework
  96. PyAr/wiki
  97. tonydamage/opendaylight-wiki
  98. android-source/platform_frameworks_native
  99. lolli42/TYPO3.CMS-Catharsis
  100. Roll20/roll20-character-sheets
  101. disigma/android_native
  102. gpcorser/cis255
  103. facebook/hhvm
  104. CSMByWater/koha
  105. robfig/plovr
  106. gitclienttester/gitclienttest
  107. github-book/first-pr
  108. chusopr/kdepim-ktimetracker-akonadi
  109. pydata/pandas
  110. servo/servo
  111. michaKFromParis/sparkslab
  112. qtproject/qt-creator
  113. soulteary/Get-D2-2014-Ticket
  114. Kitware/VTK
  115. facebook/react-native
  116. aarontc/kde-workspace
  117. jredondo/diaspora-murachi
  118. Distrotech/evolution
  119. mono/monodevelop
  120. dolphin-emu/dolphin
  121. unrealengine47/UnrealEngine4
  122. openSUSE/systemd
  123. alainamedeus/percona-server
  124. allyssonsantos/be-mean-modulo-mongodb
  125. alpinelinux/aports
  126. Mittineague/discourse-hacked
  127. jeremygurr/dcssca
  128. mozillazg/pypy
  129. ngoquang2708/android_vendor_sprd_open-source
  130. dart-lang/sdk
  131. Rockbox-Chinese-Community/Rockbox-RCC
  132. sudhamisha/vmw-kube
  133. matplotlib/matplotlib
  134. 4ad/go
  135. intel/theano
  136. LucHermitte/ITK
  137. kontulai/fdsaember
  138. jonlabroad/OpenHab-1.8.3-InsteonRestApi
  139. sriram1991/anjularJS_usefull
  140. Azure/azure-powershell
  141. adambard/learnxinyminutes-docs
  142. erikbuck/CS2350
  143. TheThingsNetwork/wiki
  144. cbeasley92/hack-summit-hackathon
  145. 2947721120/adamant-waffle
  146. hashicorp/terraform
  147. librenms/librenms
  148. Lyude/gtk-
  149. RR1007/Jenkins2
  150. Kitware/ParaView
  151. tgm4883/lr-rpi2
  152. Kasual666/WebGl
  153. jlouiss/freeCodeCamp
  154. danoli3/openFrameworksBFG2
  155. mozilla/spidernode
  156. ansible/ansible-modules-core
  157. scardinius/civicrm-core-api-mailing
  158. akka/akka
  159. mozilla/releases-comm-central
  160. stewartsmith/bzr
  161. KrauseFx/fastlane
  162. lpathy/hhvm-armjit
  163. krt16s/fdroiddata
  164. mongodb/mongo
  165. Yiutto/D3
  166. rg3/youtube-dl
  167. KDE/amarok
  168. xiaobing007/tachyon
  169. HydAu/Camel
  170. almania69/Gimp
  171. TEAMMATES/repo
  172. piwik/piwik
  173. pfsense/pfsense
  174. nadimtuhin/platform4
  175. highfidelity/hifi
  176. DarkstarProject/darkstar
  177. nishithshah2211/libvirt
  178. chscodecamp/github
  179. naraa/mangos
  180. mangosR2/mangos3
  181. mtahmed/poky-clang
  182. JuliaLang/METADATA.jl
  183. idano/home
  184. alfathony/codeigniter-restclient
  185. duckduckgo/zeroclickinfo-spice
  186. google/closure-compiler
  187. karthikjaps/elasticsearch
  188. dwinurhadia/openstack-doc
  189. sciell/ArduPilot
  190. 0xbzho/asciinema.org-2015-01
  191. wxWidgets/wxWidgets
  192. frugalware/kde5
  193. ManageIQ/manageiq
  194. dyon/skcore
  195. tmcguire/qt-mobility
  196. KSP-CKAN/NetKAN
  197. ShaneDelmore/scala
  198. freeminer/freeminer
  199. bartekrychlicki/bisect-kata
  200. cakephp/docs

It’s a pity we were unable to process some important repos like torvalds/linux back in October, as I noted in the beginning, so they are absent. bmorganatlas2/firstrepo got to the top due to the failure of the identity matching: we’ve got the clique which is PageRank’s Achilles’ heel. This is not much of our fault: the guy created 30,000 commits, each under a different, obviously fake, author. How were we supposed to realize it without using GitHub API? The same failure happened with bmorganatlas/fusiontest5 (besides, it has 10,000+ branches!). Other repositories seem to be legit and typically have thousands of contributors, hence high PageRank. We are currently building out a large data pipeline that will increase the quality of our data and expand it beyond GitHub to almost all git repositories on the web.

Summary

This post studies the contributions graph which source{d} collected from GitHub repositories. The October 2016 dataset is open on data.world and can be used to build it, however, it can be downloaded fully baked from our Google Drive. We’ve had fun with proving the theory of 6 handshakes between GitHub core developers and measuring the community members’ importance by applying PageRank algorithm.

Update

This is what you get if you visualize the Google matrix with LargeVis:

LargeVis GitHub

I used hexbin from matplotlib:

hexbin(graph[:, 0], graph[:, 1],
       gridsize=2048, bins="log", cmap="inferno")

Success! Check your email

Error! submit again