This article is the first episode of our MSR Interview blog series. This week, we’re publishing the interview of Abram Hindle, a Professor in the Computer Science department at the University of Alberta, Edmonton, AB and part of the program committee for the MSR’18 Technical papers. Thanks to Waren Long and Vadim Markovtsev and Francesc Campoy for conducting the interview.
Check out the publications around mining repositories Abram has been publishing over the last few years :
- On the naturalness of software, ICSE’12
- Greenminer: A hardware based mining software repositories software energy consumption framework, MSR’14
- Analyzing The Effects of Test Driven Development In GitHub, EMSE, 2018
Can you please introduce yourself and tell us what is your relationship with MSR and how many times have you been here before?
My name is Abram Hindle, I am an associate professor in Computing Sciences at the University of Alberta. I have been attending MSR for a very long time, I have been Program Chair, I have run the challenges a couple of times, I am in the Steering Committee. I am typically involved in MSR, whether I publish or do some services. I would say I have attended MSR at least 12 times.
You wrote a paper about code naturalness. How did you come up with this idea? What was your
That was related to the work of Prem Devanbu. I started by using Markov models and n-gram models to generate speech, basically training AI to speak absolute nonsense. I took text from Kant and text from primitivism and created something that generated slightly coherent but nonsense sentences.
Those sentences look right from distance, but upon reading, it is clear that they are nonsense. I had that kind of NLP experience before and some experience with Latent Dirichlet Allocation (LDA). Prem saw that I had the required experience and knowledge. I would mostly attribute the idea of doing NLP on source code mostly to him, but I think we collaboratively came up with the idea of n-grams on the source code.
I was definitely thinking about n-grams on the source code, what Prem definitely added was that we should use smoothing functions. Without smoothing the calculations were unusable because of all the unknowns. I have done a lot of n-gram and Naive Bayes models in the past, but in those models, you have to deal with the unknowns well. Working together we both had ideas, but I will attribute more to him.
Do you think those smoothing functions are needed if you have a moderate or a small amount of data?
Supposedly if you get a big enough dataset you do not need to smooth it. Supposedly in English and other natural languages, you do not need smoothing. You get a limited number of terms. There is a fundamental issue with software. Let us take two projects with approximately 80,000 unique terms. The intersection of those two projects will have 40,000 terms. That leaves 40,000 unique terms on each side. If you do not smooth those terms you will have massive problems.
Those do not need to be rare words, even after filtering out terms that appear in less than 10 documents we still have lots of identifiers. You can try stemming and splitting those, you are still going to get a big vocabulary that is unique. From that, we can see that every project has its own different language and that is a real, fundamental issue.
If you want to do deep learning on software using the classical NLP techniques and the modern deep learning techniques you are going to run into a vocabulary problem. It is especially clear if you want to do a network that you train and then deploy on other projects.
Actually, in our recent research, we found 1 million unique identifiers before getting to the long tail of the distribution. The tail of the distribution is still important because it is local to a couple of files and will be relevant to that topic and to those files. Once you leave those files it does not matter, but if you are doing code completion or static analysis it is still relevant. I think the smoothing is required, but you get that for free with deep learning and neural nets.
You mentioned deep learning previously. Did you try learning any deep learning models on top of source code?
My student Eddie Antonio Santos used Joshua Campbell’s results about how n-gram models can pick out syntax errors. What was shown was that you can train the model on a good source and the n-gram model will pick up syntax errors as unlikely, Syntax and Sensibility: Using language models to detect and correct syntax errors.
The same can be done with LSTM neural networks. Eddie used LSTM networks to detect those syntax errors and suggest fixes. Instead of searching a giant tree with 3 different edit modes: insert, substitute and delete as well as an 80,000 terms vocabulary, he broke out the vocabulary to around 100 terms. His model would reduce the search space to around 15 suggestions, which is very impressive. Most of the search space software engineering is doing hundreds and thousands of tries, lots of evaluations.
That language model, on the other hand, can lead you into the correct direction very quickly. His paper is impressive in my opinion. You can also extend this model so that when you suggest an identifier you can use another model to suggest a specific identifier. You can even combine an easy-update, low resource n-gram model for suggesting the identifier itself. Getting embeddings updated is much more costly, there are also problems with things unseen by the model.
Talking about the applications, you already mentioned that we can use deep learning for syntax error detection. Can you think of any other applications of deep learning?
There is source code translation, we tried using statistical machine translation to translate between python2 and python3 and we got BLEU scores of 0.99, which are only achievable for things like English — English translation. There are some slight differences between those two versions, e.g. print statements. It’s interesting that you can pick up on those slight differences with those statistical and deep learning models. There is also program repair, program editing, code completion.
There are also all the problems that already exist and are addressed with NLP models, those can be approached with deep learning models as well. Also, people are already using CNNs to pick up source code from videos and there are source code readability models, but no one is using CNNs to compute the source code readability. I think that rendering source code and applying a CNN on top of it might give a good estimate of source code readability.
All the analysis that you have done was on the token level. Have you done anything with trees or graphs?
We have not worked on trees and graphs. What we have found is that you need to be smart about tokens and tokenizing. We do a lot of information retrieval and the secret to be successful is in good tokenization. In the information retrieval realm, deep learning is not very strong. In this realm you use deep learning by taking something simple like tf-idf, you get a hundred candidates and then use a learner to rerank the candidates.
There is also the learn-to-rank approach, but typically you do not apply any learning to rank on more than a hundred entities. We care a lot about information retrieval aspect because we do things like stack trace deduplication, stack trace clustering. We also have a product that we deploy to the companies that can handle around 2 million crashes a day. It is an online clustering algorithm. If I want to use deep learning representations I probably would use them on top of the results. It depends on what you want to do, but the representation aspect of deep learning is very interesting.
The other thing you can do is using generative approaches to make things up. You need to remember, that you do not just need to generate sequences, they must seem natural. If you do not have enough data you can use bootstrapping and sampling. We do that for audio recognition tasks like recognizing a piano or a guitar. We synthesize songs and audio and put it into the loop, it trains well.
How would you reliably generate code?
It depends what you want to generate code for. Asking the code completer to finish code. Getting regions of code or modifying regions of code, asking a deep learning model what a good or a bad mutation is, those things should be doable. You might not be able to generate a ton, but you should be able to generate something.
If you have a bag of words, you might represent code as a feature vector if you could generate such vectors. That might be useful not only for deep learning. We have problems in terms of stress testing crash reports. We synthesize crash reports. This is how we show that we can do clustering on 2 million crash reports a day. We do not generally have access to that many crash reports. It is mainly for validation, but you can still train on it and have the confidence of what you are getting.
How do you think machine learning on source code will change the way we write code in 5 or 10 years?
I think it will help with exposing APIs from libraries. I also expect better, more concise representation that I can send through a net and get an analysis. I think we will also see tools that improve the quality of the source code, that improve the style. If you take a look at the Scratch ecosystem, where you can drag and drop stuff, everything there is syntactically correct.
If we start applying those rules to general source and it is going to start saying “what you are doing is not quite right, I know what you are trying to achieve but I am not going to accept the commit until I feel it is going to work.”. In a lot of dynamic languages, you ask if the code you wrote is valid. One of the benefits of the language model is that you can ask whether it is a code that was previously seen by the model. I think that we will see much more AI in the space of dynamic languages.
What was your favorite talk on MSR 2018?
I think it was the one that got the award, the one about basic antipatterns like not doing curly braces around if-statements, I really liked this one. One of the reasons is that recently I was in an argument with one of my students about code quality, now I have a dataset proving my case. What I liked about it is that it shows that silly complex syntax is confusing and causes problems. There is value in having octal code, but prefer having a macro to emphasize the octal code.
Learn More about MSR 2019 and MLonCode: