To give you a bit of background, at source{d} we analyse every GitHub repository and run our own version of git blame (go-git) across 900 million commits. This gives us unique insight into the code published by over 6 million developers. It’s important to note that this post is only based on GitHub repositories, over time we hope to expand to all Git projects hosted on the web.

There have been a lot of posts about gender in the development community, we realised that we were in a great position to contribute with data. It started with the question, how will we determine gender across 6 million developers. We took the approach of classifying names based on their statistically likelihood to be either male or female. We started with cleaning up the names used in commits: separating first names, last names and usernames. Secondly, we went to look for data sources for name genderization. Which led us to several census database (UK & USA) and several API’s (primarily to cover non-latin names). After a lot of cleaning, we had compiled a database of over 144,000 names with their: gender, # of occurrences and our statistical likelihood of being correct. We are still improving and adding on different methods to determine gender to make our study more accurate before we release it.

From the 6 million developers who have publicly contributed to GitHub we were able to determine gender based on name for approx. 2 million. In the coming month, we’ll be releasing different posts and analysis on this data set.

Before we post about percentages, country specific differences and trends over time, we wanted to take a moment and highlight some of the women who have been active in the open-source community. This list in no way is exhaustive, and we have tried our best to keep it as objective as possible. To do this, we decided to first slice our data set on two variables:

  1. Total # of commits
  2. Our own version of PageRank*

*We look at every developer as a node in a graph and every project they contributed to as an edge, we then weight each edge based on the ratio of # of bytes contributed to that project. Once we have this graph, we apply the PageRank algorithm. Please note that this is a reputation metric and is hence greatly influenced by the co-contributors across repositories.

We identified every developer whose name we had classified as female, whose total # of commits was above 1,000, and ordered based on PageRank. We then manually reviewed over +1000 GitHub profiles to ensure that the contributions were open-source projects. We decided to allow projects from any field (besides computer science you’ll find bioinformatics and astrophysics represented here) and if there was no license but it was clearly open-source we allowed it (it’s debatable if we should be more strict here).

The data set used to get to the commit count, is an analysis of the raw pack files of every public Git repository on GitHub until 21st April 2016. Often the counts of commits will differ from the GitHub profiles which seem to use a different method.

We would like to make it very clear that metrics such as PageRank, stars and followers on GitHub are pure vanity metrics. We used PageRank because we like that it shows the community aspect of whom you’ve worked with and on what projects but over time we’ll be releasing better metrics that focus on quality & impact. As a developer today you should not be coding to get stars on GitHub.

We have tried to reached out to every person on this list to ask if they had any objection to being included or if they wanted any of their information changed (we’ve heard back from 72 and unfortunately had some emails bounce). If you’d still like to be excluded or have your information updated, please contact me (eiso@sourced.tech). We are certain that we have missed some amazing developers and hope that it’s clear that this list is to highlight just a small sample of incredible developers.

Any suggestions or improvements for posts can be requested on GitHub.

We would like the discussion for this post to happen at YC HN because we believe the community is great at self-moderating (I’ve personally been a user since 2008).

You can also follow the tweets of this incredible list of developers here: Twitter list.