Success! Check your email

Error! submit again

GitHub repositories' statistics

It’s always fun to play with a dataset few people have ever played with. source{d} knows much about each GitHub developer, including the number of bytes written in each programming language, all commits metadata, etc. So a trivial takeoff will be to use this information to have a better understanding of the industry.

First of all, software developers are humans (yet), and some of them make open source contributions, which is a natural process. Therefore one may expect the distribution of the overall number of bytes written by each GitHub user to be log-normal. Well it’s not:

overall

There are much more developers who wrote less code than average than those who wrote more code than average. Yet if we look at each language individually, the picture becomes log-normal:

C

Java

Python

The more code is written in the language, apparently the more the peak shifts to the right. For example, Go’s peak is 9.22 while C’s is 9.81.

Go

If a language goes out of the mainstream, the left slope becomes steep and the right one flat:

Cobol

Pascal

Interestingly, some common languages are irregular:

Javascript

Ruby

It turns out that Javascript developer density stays the same in a broad interval 400 - 400000 bytes. The gap between the numbers of casual and productive rubyists is as high as 2x.

If we look at repository sizes, they are log-normal too:

Java

Python

These distributions demonstrate the fact that Python is less verbose than Java: linear mean repository size is 30% smaller for the former.

Let’s look at the number of contributors per repository:

Contributors - log(number of repos)

Clearly most of GitHub repos are used solely by their owners. But the picture changes if we consider the overall sum of bytes instead of repositories number:

Contributors - log(bytes)

Thus, most of the code is written in repositories with 2 contributors.

Finally, let’s look at commit stats. The distribution of the number of commits made by each developer is not log-normal, it decreases polynomially. Here are the first 10:

Commits - 10

And the rest on the log scale:

Commits - log

So there is no such thing as the most common commits number, apart from 0 and 1. Besides, it appears that the number of commits poorly correlates with the amount of code written, otherwise we would get a log-normal distribution.

While all this analysis is fun, it’s even more fun to repeat it after some time, e.g. in a year. Watching how the state evolves will allow to predict trends in software development and open source community.

The data for this post was obtained in April, 2016. We used src-d/go-git to fetch all GitHub repositories.

Success! Check your email

Error! submit again