On June, 3-2017, source{d} dedicated their regular source{d} tech talks to Machine Learning and we chose to host the event in Moscow, Russia. For this conference, we invited speakers from Russia and abroad and gathered about 80 neural network aficionados in a former industrial area of the city.

day’s programme

To begin with, everybody joined together around a hearty welcome breakfast inside KL10CH spaces in the city center.

Then, after everybody woke up slowly, it was time for our CEO, Eiso Kant, to launch the main talks series. These lasted 45min each, with time for Q&A. Furthermore, 2 lightning talks of 15min occurred between the main ones, to address more specific and smaller topics.

main talks

Statistical Analysis of Computer Program Text, Charles Sutton

“Source code is a means of human communication”, with this first formula, the professor at University of Edinburgh, Charles Sutton, couldn’t better start the day. He next laid out his statistical approach to analyze source code texts. In order to extract from scripts what he called, implicit knowledge, he introduced three innovative software engineering tools inspired from machine learning and natural language processing (NLP) techniques:

  • Naturalize, a probabilistic language model for source code which learns local coding conventions. It suggests renaming or reformatting changes so that your code would become more consistent.
  • HAGGIS, Mining idioms for code a system that learns local recurring syntactic patterns, which we call idioms, using a nonparametric Bayesian tree substitution grammar (TSG).
  • Probabilistic API Miner (PAM), a near parameter-free probabilistic algorithm for mining the most interesting API patterns from a list of API call sequences. It resolves fundamental statistical pathologies like the formation of redundant or spurious sequences.

Similarity of GitHub repositories by source code identifiers, Vadim Markovstev

Vadim, our lead of machine learning, went va banque and disclosed all the recent work he has done. The talk was a teaser for the upcoming source{d}’s ML Python stack: he presented all the technical details how it is possible to find similar GitHub repositories by their contents. Particularly, Vadim found the way to embed source code identifiers (previously used in topic modeling, see the paper) very similar to word2vec. Those embeddings can be trained at scale using Swivel, a better alternative to GloVe, and src-d/swivel-spark-prep. Finally, similar repositories are searched using src-d/wmd-relax - an optimized calculator of Word Mover’s Distance.

Probabilistic Programming for Mere Mortals, Vitaly Khudobakhshov

In his talk, Vitaly presented a review of an emerging topic at the juncture between cognitive sciences and Artificial General Intelligence (AGI). That’s the huge controversy about what language is the most efficient to solve a particular problem that raised Vitaly’s interest in Probabilistic Programming (PP). To make it simple, a Probabilistic Programming Language (PPL) is an ordinary programming language considered as a set of tools to help us understand the program’s statistical behavior. This field of research has been particularly useful in designing programs like cognitive architectures, which use a wide range of programming techniques, or in minor issues like pattern matching and knowledge representation. Vitaly believed that PP with partial evaluation might be effectively applied to AGI problems.

Although PPL programs are close to ordinary software implementations, whose goal is to run the program and get some kind of output, the one of PP is analysis rather than execution. The main obstacle in using PP in large problems is the efficient implementation of inference. Some techniques like genetic programmingand simulated annealing techniques have yielded good results.

Finally, as a satisfying PPL, Vitaly gave us insights of Church which is a derivative of the programming language Scheme with probabilistic semantics programming language, and whose syntax is simple and extensible.

Sequence Learning and modern RNNs, Grigory Sapunov

Grigory started his talk with a tiny, but not superfluous intro into RNN, LSTM and GRU, along with their bidirectional and n-directional generalizations. Next, Grigory presented two interesting LSTM generalization : tree-LSTM and Grid LSTM. The fist tree-structure outperforms the previous systems on predicting the semantic relatedness of two sentences and sentiment classification while the second network of LSTM provides a unified way of using LSTM for both deep and sequential computation.

Relying on these preliminary notions, he tackled issues about representation learning. The first idea was to find a model that pays attention to the word ordering unlike word2vec based on the “bag of words” model. Secondly, he showed us how to match different modalities simultaneously thanks to multi-modal learningwith striking examples like:

In a last paragraph, Grigory approached the Connectionist Temporal Classification (CTC) technique, as well as the Encoder-Decoder architecture to train sequence-to-sequence neural network models.

Neural Complete project, Pascal Van Kooten

Pascal ended our “AI on code” day with the perspective of auto-complete. He shared with us his project, called Neural Complete which aims at completing our source code through not only word but whole line suggestions.

This tool based on a generative LSTM neural network is trained by python code on python code. Thus, the main result is a neural network trained to help writing neural network code. Finally, after giving us a demonstration of how it worked, he invited people to train the model on their own code so that it would be more relevant.

lightning talks

Embedding the GithHub contribution graph, Egor Bulychev

Egor is a senior ML engineer at source{d}. He disclosed an unusual approach to embedding GitHub social graph nodes, compared it to node2vec and applied it to finding similar GitHub repositories. Since the nature of the similarity is completely different from Vadim’s content analysis, the examples showed alternative results. One of the funniest Egor’s findings was the proof that system administrators like to drink beer more than coders and they tend to contribute to repositories related to beer.

Hercules and His Labours, Vadim Markovstev

Vadim went on stage for the second time and demonstrated the supremacy of src-d/hercules, a super fast command line tool to mine the development history of Git repositories. Hercules uses src-d/go-git, our advanced and nearly feature complete Git client and server implementation in pure Go. Provided by the whole repository is stored in-memory and the original incremental blame algorithm, Hercules processed the whole Linux kernel repository in just two hours. We encourage everybody to try Hercules on their own projects!


At the end of the talks, we spent pleasant time eating and drinking beers together. It was time to share our feelings about the day. The speakers were also available to develop their talks and answer more questions.

The Moscow source{d} tech talks ended here. Now the team is already preparing our next Frontend talks in Madrid on the 24th of June 2017. You can get your free tickets on Eventbrite.


source{d} would like to thank the speakers and the attendees for sharing our passion for Machine Learning on Code and for their kind feedback on our post event survey. If you feel interested in any of our projects, do not hesitate to join our source{d} community slack. You can also take a look at our job opportunities ; source{d} is always looking for new talents.

To conclude on a more personal side, I want to sincerely express my gratitude to all people at source{d} who made a contribution of any kind in the success of this event in such a beautiful city.

This post was written by Waren Long. Follow him on Twitter: @warenlg.