Given our interest in Machine Learning on Code at source{d} and with Tensorflow World happening this week, we thought it would be fun to analyze all the Tensorflow git repositories with source{d} Enterprise Edition (EE) to extract interesting insights for the Tensorflow community. source{d} EE not only saves us time through higher query performance but also allows us to showcase advanced metrics that are not (yet) available out of the box in source{d} Community Edition (CE).

Follow this link to view a read-only dashboard of the entire Tensorflow Project analysis. Keep on reading for a summary of our key findings.

Release Cadence

Looking at the project release timeline below, it seems like the release cycles are not as consistent as other open-source projects such as Kubernetes. Although the release velocity looked quite high, with 8 releases in 2018, that velocity slowed down significantly in 2019. Note that version 1.13 had to be skipped due to a bug during the release process.

In fact, as we can see in the chart below, most of the 2019 releases are tied to other repositories in the Tensorflow Organization.

Files and lines of Codes by programing languages

With more than 17K files, 4M lines of code, it is safe to say that the breadth and depth of the Tensorflow project have been growing significantly over the past few years. After 4 years of consistent growth and fast development, both the number of files and lines of code seem to be slowing down with significant refactoring efforts as part of the 2.0 release last month.

The growing number of programming languages used in the tensorflow codebase which just reached 42 since the beginning of 2019 also confirms the growing project scope and complexity.

In the chart about programming languages below, we can see that Python and C++ are by far the dominant languages both in terms of the number of files and lines of code per language. The analysis also shows that Javascript, TypeScript and CSS have been extracted to separate GitHub projects while some others like Go and Java have been gaining momentum.

By taking a closer look at the evolution of programming languages in the Tensorflow codebase, we can see that a lot of python code was removed right before the 2.0.0 release while the number of C++ lines remained the same. The fact that the low-level logic (like multiplication and so on) was written in C++ explains why it wasn’t affected while high-level logic in Python was significantly refactored.

Commits activity and Contributions

The relatively small number of repositories (83) in contrast with a large number of contributors (5920), shows a strong focus on the core repositories tensorflow, models and docs. The evolution of the number of contributors per repository chart below shows that the project now has between 500 and 600 unique contributors per month.

Knowing that Tensorflow is a project that was open-sourced by Google, it is not surprising to see that and Google are the biggest contributors by the number of commits. The number of contributions by individuals (those with and emails) is fairly high, a sign of a healthy open source project and community. Nvidia and Intel seem to be the only other organizations that made significant enough contributions to appear on this chart.

Excluding contributions from tensorflow-gardener, a bot to automate repo maintenance and Copybara, a tool for transforming and moving code between repositories, we can clearly see that the biggest individual contributors are also from Google. Big shout out to Shanqing Cai, Nikhil Thorat, Daniel Smilkovm and Yong Tang who clearly stand out as the top contributors.

The nature of tensorflow-gardener's commits which highlights internal contributions from Google can be further plotted over time. We see that the heavy changes tend to happen during or right after releases.

Number of commits and size of changes over time authored by tensorflow-gardener.

Nonetheless, the evolution of commits and lines of code authored by tensorflow-gardener vs others indicates that the project is also attracting more and more external contributors.

With a range of 8,000 to 10,000 commits per month, the total number of commits remains quite high, a healthy sign of a very active open source project. Although that number looks high, we assume that Google has a lot of internal contributors that do not use GitHub directly and merge multiple commits into one, so these numbers could be even higher.  The fact that the commit velocity continues to grow confirms that trend as well as the popularity of the project. It’s worth pointing out the big surge in the number of commits right before Tensorflow 2.0 release.

If we look at the nature of these commits, we can see that the number of contributions to the core tensorflow repository now represents just over half of the total with most of the new contributions now directed to side projects, such as tfjs a WebGL accelerated JavaScript library for training and deploying ML models, the official documentation and the Tensorboard visualization toolkit. The commits evolution reveals that the core project is reaching maturity while Google and the community are focusing on the overall user experience both in terms of onboarding and actual deployments.

# of commits overtime on top repository in the organization, detailed

By plotting the correlation between commit time series belonging to each project the following chart reveals the temporal changes coupling. Note that the brightness indicates the relative commit frequency. Commits to tensorflow, tfjs, models and docs often appear at roughly the same time. tfjs-layers, tfjs-converter and hub are also related. Besides, this plot indicates abandoned projects i.e fold, playground, embedding projector, etc

Aligned commit time series in the organization

Pull Requests and Issues activity

The chart below highlights the most active repositories based on the number of pull requests (PRs) over the past 4 years. Not surprisingly, we can see that most of the action happens on the TensorFlow repository itself with tensorflow/tfx (a Google-production-scale machine learning platform based on TensorFlow.) and tensoflow/docs as the second and third most active repositories. The nature of these PRs and issues confirms the project focus on user experience and enterprise adoption.

The table below highlights the top pending pull requests as of September 2019, including information about their age in days, the number of comments, as well as the number of lines of code modified. Excluding the Keras Resnet PR which is not intended to be merged, the oldest open PRs in the TensorFlow organization seem to be related to the swift-APIs and a community plea to make Tensorflow more modular by creating “more focused TensorFlow modules [that] can be created, maintained and released separately.”

The following chart confirms that the Tensorflow maintainers are super responsive with an average “Time to Merge” around 4 days with understandable productivity pitfalls in late December / early January with most contributors taking time off for the holidays. Surprisingly, the upward trend from the summer of 2018 did not happen over the summer of 2019 which reveals a clear sense of urgency in shipping the 1.14 release.

The Tensorflow project throughput is another interesting metric to look at to further understand Open source project velocity. Throughput can be defined as the number of features added or bugs fixed within a given period. In this case, we’re measuring the number of closed GitHub issues. The following chart shows that the total number of closed issues has been growing consistently over the years with major drops right before and after major releases such as tensorflow 2.0 in September 2019.

A big thank you to Edd Wilder-James and Martin Wicke for reviewing this analysis and providing some feedback.

Learn More: