Recently I started to collect all the available metadata (name, number of stars, forks, watchers, etc.) from the most popular GitHub repositories. I chose the “number of stargazers” as a measure of popularity. This metric is by no means perfect, but at least should have a strong positive correlation.
How I quickly grabbed all repositories with ≥50 stars (over 120k) using a crappy script.
This is the log log histogram of the repository-star relation:
import pickle, numpy, powerlaw # Load the repositories metadata from GitHubStars with open("repos.pickle", "rb") as fin: repos = pickle.load(fin) # Extract the series of star numbers stars = numpy.array([r.stargazers_count for r in repos]) # Fit into all possible discrete distributions with a single awesome line # There is no data before 50; the distribution becomes unstable after 1000 fit = powerlaw.Fit(stars, xmin=50, xmax=1000, discrete=True) # Plot the projected and fitted probability distributions fit.plot_pdf() fit.power_law.plot_pdf(linestyle="--")
We see that the fit is good. It’s parameters are: μ=-15.31, σ=5.23. We crop the observed interval by 1000, it contains 93% of all the analysed repositories and does not include very high rated noisy samples (as seen on the histogram or on full PDF). Those noisy samples are unstably, randomly distributed and are not fittable. Let’s compare the log-normal hypothesis with the power-law and exponential ones.
>>> fit.distribution_compare("lognormal", "exponential", normalized_ratio=True) (74.347790532408624, 0.0) >>> fit.distribution_compare("lognormal", "power_law", normalized_ratio=True) (1.8897939959930001, 0.058785516870108641)
These are the handy loglikelihood trials built into powerlaw, link to the documentation. It can be seen that with a 100% confidence the log-normal fit is better than exponential and with 94% confidence better than the power-law.
All right, what about the number of forks? Every registered GitHub user can fork a repository to his or her personal account, incrementing the corresponding counter of the origin, and GitHub API reports those counters’ values. Their distribution appears quite different:
However, it fits well to log-normal within the interval [30, 5000] forks:
Please note: we are plotting in the log-log domain, so the imaginary plateau on the histogram is not the “real” one.
What about the number of open issues?
Fit in [10, 600]:
Thus the majority of the top rated repositories has a small number of open issues, particularly 80% have less than 18.
The three mentioned metrics appear to have the same distribution kind. Are they actually correlated? A quick shot with pandas reveals it:
dataset = numpy.empty((len(repos), 3), dtype=float32) dataset[:, 0] = [log(r.stargazers_count) for r in repos] dataset[:, 1] = [log(r.forks_count + 1) for r in repos] dataset[:, 2] = [log(r.open_issues_count + 1) for r in repos] import pandas df = pandas.DataFrame(dataset, columns=["Stars", "Forks", "Open issues"]) axes = pandas.tools.plotting.scatter_matrix(df, alpha=0.2)
So the answer is yes, all three are positively correlated. To be precise, here is the correlation matrix:
It appeared that some highly rated repositories became ghosts, that is, are empty.