Topic modeling is the machine learning subdomain which is devoted to extracting abstract “topics” from a collection of “documents”. Each “document” is represented by a bag-of-words model, that is, a set of occurring words and their frequencies. Since I am Russian, I had the introduction to topic modeling through the awesome lectures by Dr. Vorontsov at Yandex’s School for Data Analysis PDF. There exist different models to do topic modeling, the most famous (but not the best) being Latent Dirichlet Allocation (LDA). Dr. Vorontsov managed to generalize all possible bag-of-words-based topic models into Additive Regularization of Topic Models methodology (ARTM). Thus LDA becomes a subset of possible ARTMs. The really cool thing about ARTM is that Dr. Vorontsov’s PHd students developed the proof-of-concept software and it is open source: bigartm/bigartm [1, 2].

Normally, topic modeling is applied to ordinary text documents. We at source{d} struggle with huge amount of data originated from GitHub repositories (and soon we’ll handle every open source repository in the world). A natural idea is to treat every repository as a bag-of-words and apply Dr. Vorontsov’s powerful engine for topic modeling. For example, we could extract names from the source code and consider them “words”. This article elaborates in this direction, presents the results and explains how to repeat them.

TL;DR:

docker run srcd/github_topics apache/spark

(replace apache/spark with any GitHub repository you wish to analyze).

OpenDocument spreadsheet with the extracted topics.

JSON with the extracted topics.

Trained model - 40M, gzipped pickle for Python 3.4+, Pandas 1.18+.

Dataset on data.world.

After getting the ARTM result, I had to manually label the topics according to the key words and the related repositories. I set the number of topics to 200 but it turned out that I should have set a greater number as the variance of topics on GitHub is really big.

I did the initial preprocessing on Dataproc and the final steps locally on a powerful workstation. The resulting sparse matrix size was around 20 GB and had to be converted into simple text-based Vowpal Wabbit format so that BigARTM CLI program could work with it. The dataset was crunched rather fast, in less than 2 hours on a single computer:

bigartm -c dataset_vowpal_wabbit.txt -t 200 -p 10 --threads 10 --write-model-readable bigartm.txt --regularizer "0.05 SparsePhi" "0.05 SparseTheta"
Parsing text collection... OK.  
Gathering dictionary from batches... OK.  
Initializing random model from dictionary... OK.  
Number of tokens in the model: 604989
================= Processing started.
Perplexity      = 586350
SparsityPhi     = 0.00214434
SparsityTheta   = 0.422496
================= Iteration 1 took 00:11:57.116
Perplexity      = 107901
SparsityPhi     = 0.00613982
SparsityTheta   = 0.552418
================= Iteration 2 took 00:12:03.001
Perplexity      = 60701.5
SparsityPhi     = 0.102947
SparsityTheta   = 0.768934
================= Iteration 3 took 00:11:55.172
Perplexity      = 20993.5
SparsityPhi     = 0.458439
SparsityTheta   = 0.902972
================= Iteration 4 took 00:11:56.804
...

-p sets the number of iterations. I wasn’t clear enough about what regularizers I should use so I activated only “sparsity”. I hope the project will improve the documentation and add examples about them. It is important to notice that the amount of RAM needed for the operation was rather low, no more than 30 gigs and this is a huge success on gensim and, oh god please no, sklearn background.

topics

The resulting 200 topics can be divided into the following groups:

  • Concepts - general, broad and abstract.
  • Human languages - it appeared that one can determine programmer’s approximate native language looking at his code, thanks to the stem bias.
  • Programming languages - not so exciting since this is the information we already have. Programming languages usually have a standard library of classes and functions which is imported/included into most of the programs, and the corresponding names are revealed by our topic modeling. Some topics are more narrow than a programming language.
  • General IT - the topics which could appear in Concepts if had an expressive list of key words but do not. The repositories are associated by the unique set of names in the code without any special meaning.
  • Communities - devoted to some specific, potentially narrow technology or product.
  • Games - video games.
  • Bullshit - 2 topics which I really don’t know what they mean.

Concepts

The most interesting group in my opinion is Concepts, since we mined much useful information:

  1. Pizza contains cheese and there are GitHub repositories about it.
  2. Mathematics, linear algebra, cryptography, machine learning, digital signal processing, genomics, particle physics terminology.
  3. Weekdays. Monday, Tuesday and so on.
  4. All possible facts and characters in RPG and other fantasy games.
  5. IRC has nicknames.
  6. A lot of design patterns (thanks to Java and PHP for them).
  7. Colors. Including some exotic ones.
  8. Emails have CC, BCC and are sent through SMTP and received through POP/IMAP.
  9. How to make a good datetime picker. Seems to be a pretty common GitHub project, eh.
  10. People work for money and spend them on houses and driving (apparently, from houses to the work and back).
  11. All sorts of computer hardware.
  12. Exhaustive list of HTTP, SSL, Internet, Bluetooth and WiFi terms.
  13. Everything you want to know about memory management.
  14. What to google if you want to create a custom Android firmware.
  15. Barcodes. Plenty of different types.
  16. Humans. There are men and women, they live and they have sex.
  17. Good list of text editors.
  18. Weather. Tons of common words.
  19. Free licenses. This was not supposed to happen because source code names and licenses do not intersect. From my experience with Pygments, some languages are supported much worse than the others and I guess some of the texts were recognized as sequences of names.
  20. Commerce. Shops have discounts and sell goods to customers.
  21. Bitcoins and blockchain.

Human languages

Human languages contain Spanish, Portuguese, French and Chinese.

Programming languages

An interesting find in Programming languages is the topic “Non-native English PHP” which associates with PHP projects written by non-native English speakers (or not speakers at all). Besides, now we know that Java has JNI and bytecode.

General IT

General IT is not so fun. There are a lot of OS kernel repositories which are large and make the topics noisy, but still we’ve got some worth mentioning:

  • Comprehensive information about drones. They use Linux.
  • There are a lot of Ruby implementations. Some are hard forks (that is, not marked as fork by GitHub) and some are extreme forks (that is, uploaded codebase without any history).
  • onmouseup, onmousedown and onmousemove - these are the three giants holding the UI.
  • Plenty of Javascript buzz words and technologies.
  • Online education platforms. Especially Moodle. Tons, tons of Moodle.
  • All sorts of CMS.
  • Coursera Machine Learning gives a great list of repositories with homeworks for Coursera’s popular ML courses.

Communities

Communities is the largest group which is almost half of the size of the whole. We’ve got all the web engines written in Python, Ruby, PHP, Java, Javascript, etc. A lot of repositories are actually private storages of text editor configurations, namely Vim and Emacs. Since there is only one topic for Vim and Emacs has two, I hope that this will be the ultimate end in the holy war.

PHP engines which were discovered are Wordpress, Joomla, Yii, VTiger, Drupal, Zend, Cake and Symphony with Doctrine. Python: Django, Flask, Google AppEngine. Ruby: Rails and only Rails. Raaails. All Java web engines collapsed into the single topic. And of course there is a topic belonging to Node.js.

Many projects appeared to use Tesseract, which is an open source OCR engine. Besides, many seem to use Caffe.

Quake 3 / idTech 3 engine is so popular that formed a separate topic. Unity3D has two, the one being the umbrella under numerous student homeworks and pet projects. Cocos2D seems popular too and has 2 topics as well. Finally, I got 3 OpenGL topics + WebGL. I guess that the difference is in the way API is used and in the satellite libraries.

It’s no surprise that Chef configuration management tool appeared in the “cooking” topic with it’s recipes, kitchen, etc. However, I was surprised to see the dual topic about WinAPI and Pokemons. I think this is because stemming made the typical names from WinAPI look like Pokemon names…

Games

There is a topic devoted to SDL and another to Minecraft and RPG.

released swagger

I’ve prepared the Docker image so that everybody can apply the trained topic model to any GitHub repo. Just execute

docker run srcd/github_topics apache/spark

and see the top 5. There is the serialized matrix of topics and terms inside, you can download it separately: link. The format is pickle version 4 with the tuple of length 2, first element is Pandas 1.8+ SparseDataFrameobject and the second is the IDF array. Besides, there is the OpenDocument spreadsheet with the topics and the JSON file with the topics.

conclusions

As already noted before, 200 topics are too few, since many appeared to be dual or not expressed well. If I set the number to 500 or 1000, I would get better results, but never complete the manual labelling. Understanding numerous PHP ecosystems is a challenge if you are not in the topic (he-he). Nevertheless, we’ve still got something fun. Most of the communities formed a separate topic because they were able to compose the unique vocabulary for the related narratives. An exciting discovery is that we could extract general topics about humans, nature, sciences and even design patterns.

The next step would be to mine the readme files and other textual data contained in the repositories. They are expected to enrich “Concepts” group.

This post was written by Vadim Markovtsev. Follow him on Twitter: @vadimlearning.