This article is the second episode of our MSR Interview blog series. In case you missed it, check out the interview with Abram Hindle. This week, we’re publishing the interview of Georgios Gousios who’s an assistant professor of software engineering at the Software Engineering Research Group group at TU Delft. Thanks to Waren Long and Vadim Markovtsev and Francesc Campoy for conducting the interview.

Georgios Gousios

Below you’ll find the links to Georgios’ publications on the topics of Machine Learning on Code:

Could you please introduce yourself?

I’m Georgios Gousios, I’m an assistant professor at TU Delft, in the Netherlands. The general thing I’m working on is Software Analytics (SA). I’m leading the SA lab which is part of the Software Engineering (SE) group in Delft, and what we’re trying to do here is basically what the lab title says, that is extract value from data that is being generated while programmers are working or while programs are running. Thus, both static and dynamic analytics.

What is your involvement with MSR?

I was following MSR since 2008, so for 10 years already. It’s my community, I did my Ph.D. on SA data and data processing. This conference is like my home, it’s close to what I am doing I had various roles at MSR throughout the years, this year I’ve been Data Showcase co-chair. In 2020 I’m going to be Program Committee chair at MSR, thus organizing the program, reviews and so on.

Even though MSR is a pretty niche conference, there are a lot of topics that are covered, do you have a favorite one?

It sounds pretty niche but actually is not. If you were at the keynote today at ICSE, you might have noticed the organizers said that Empirical SE and mining software repositories are topics number 1 and 3 in the number of submissions they got, so it is not that niche. About topics tackled at MSR, I would say a couple of those. These days I’m into mining data in real time, as a lot of data is being generated by tools like CI, GitHub, developer IDEs and so on, the goal is to get this information as streams and use a stream processor to join them, filter them, reshape them, and provide them back in a form that is more digestible, but in real time. That’s really interesting to me. The other thing is dependencies. We are doing lots of work on fine-grained dependency analysis, using MSR techniques: we download repositories from GitHub, extract information from the dependency configuration files and then try to build code graphs to understand those.

How are you doing that? Are you building software from scratch to analyze all of this?

Mostly yes, in fact, the main problem I see at MSR is that we don’t have infrastructures. I tried to create one when I was doing my Ph.D. but it was rather naive at that time, so it did not catch on. The thing that binds us together as a community is data, not so much certain coding infrastructures. We could improve a lot on that, I think, but it’s not easy to write one tool that everybody would like. It’s extremely difficult actually.

What kind of tools could help the all community, and not only specific use cases?

The kind of stuff we are lacking would be, for example, a tool that would download all of GitHub somewhere and then, using some kind of platform, be it Spark or a custom platform, would allow you to process specific parts of the data. For example, in my dependency analysis tool it would be very interesting if I had all of GitHub, to then filter it to keep only Python packages that have a dependency file, and from those files actually, extract the contents and just keep those. This is what I care about, I don’t care about downloading. Curating the dataset is also painful. This is why I was trying to use source{d} engine.

You keep on mentioning GitHub, what has been your experience mining GitHub so far?

I have done quite a bit of work on mining GitHub, but what I was mining was mostly development metadata. Perhaps you know the GHTorrent project, I wrote the software at the beginning. What we do here is that we extract all things that we can retrieve from the GitHub API, but this is mostly developer metadata like pull requests, comments on code reviews and things like that. What we’re missing is the actual contents of the repositories. You can perhaps select a couple repositories using GHTorrent, but it’s a bit difficult doing the mapping between the metadata back to the repositories without any manual steps — it’s not like it’s nuclear physics, but this is what I’m missing.

What do you think of GitHub as a platform, do you think it will be somehow replaced because it is not fast enough? How do you expect GitHub will change or evolve?

GitHub has been evolving way slower than I would expect, I don’t know why (sigh). I mean, giving the richness of information that they stored, they should have an MSR team of 30–40 people, actually trying to extract some juice out of their data. They have everything, they are the hub, so if they get some competitor they can easily outcompete them, just because they have the data — the data is the new oil as you say or the new gold. I’ve been discussing with them since 2013. Even if they took the work that is being done at MSR, let’s say one paper per year, and actually implemented that, they would be in a much better state. But for some reason, they keep reiterating on the things that they know best.

What would be the ideal stream of data to understand how people code? You were mentioning logs and events from the IDEs, which one do you feel is the one we should be Which one have you used with most success, or is the most useful in your view?

It’s a very difficult question, it depends on the use case. After implementing streams from both GitHub with GHTorrent, and then developer IDEs with the TestRootsproject, now we are building an infrastructure to integrate everything, it’s called CodeFeedr. There was a presentation about it today. You know if you really want to see how developers are coding, I would say something from within the IDE will give you a microscopic view, things like “I added this or did this in order to refactor this” and so on, but the macroscopic view, so how programs evolve, you cannot get it from the IDE, you need another data source. I think you can not exclude any data source if you want to answer this question, you need everything and you need to ask the right questions, that’s the most important thing.

When you say you are integrating everything, you mentioned logs and stuff like that, how are you doing that, are you showing the logs on the IDEs?

No, we are not doing that kind of thing, we are building an infrastructure that will allow anybody to do that. The typical example I give about logs is let’s say you deploy a new version: this generates an event, actually multiple events. Then you have a commit that tags the new version, an event that the deployment succeeded, and then you start receiving logs for the new version that is annotated with the version number. But let’s say you see that the exception rate starts to explode because something went wrong, what you can do is that if you follow that in real time you can analyze the exceptions, and send an email back to the developer that introduced the bug or made the deployment. What you could do is a full feedback loop, using just one query that would integrate data from commits, deployments, logs … with no time skip. Developers would love to have this information as fast as possible.

You are talking about fetching all this data to understand how developers work. Once we understand that, what is the goal? Do we want to improve their life? Do you have ideas as to how we can do that?

I don’t know what the goal of everyone is, mine would be to make software more reliable by making the developer’s lives easier. I mean, if you do a code review, for example, you don’t want to correct for style, this should happen automatically, we are almost in 2020. Another example would be if we are monitoring dependencies for security vulnerabilities, what we have seen or rather what research has shown, is that developers are not willing to update dependencies because they are afraid that their code might die or will break. We have developed a technique that, given a program can see whether there is a direct call path from somewhere within your package to a detected vulnerable piece of code. This is how we can make developer’s lives easier, here by allowing them to make a decision based on whether they are using the corrected functionality or not. This is the dependency project I was mentioning earlier, it is called Prazi. We implemented it for Rust and we are working with the Rust community to have it integrated in their package manager.

You also mentioned something on which we are actively working on at source{d}, which is automating code review. At some point we think we need to add machine learning (ML), what do you think about people doing that, specifically deep learning (DL), on source code? Have you tried it yourself?

Yes, I have tried quite a bit actually, my first ICSE paper was using ML to predict whether a PR would be merged or not at the moment it was submitted. It was marginally successful I would say, you lack lots of information, e.g. some people just submit a PR to try out an idea, they then receive some feedback and improve things. That’s an application, but in general, I think there are tons of potential. Another thing that I have done is to use ML to prioritize PRs. So if you have 20 PRs that are open, it could be useful to present to the developer each day the 3 he should have his attention on because they will receive some action in the next time window. This is what we are trying to predict, and we actually had an accuracy thereof about 88%, so it was very successful. In the end, we try to say to the developer “what you should do tomorrow is pay some attention to those 3 PRs”.

Are you using the information on the code itself to do that or the metadata on the PR?

I think we did not use anything in terms of code, mostly metadata on the PR, and also did some in-memory merging of PRs. If there are conflicts, the PRs are not going to be accepted, so we did the pairwise merging of all the PRs, and that was one of the features we used. It was actually one of the most important ones and in fact, that was also what developers were saying.

Have you used ML on the source code itself, and have you done it based on characters, tokens, trees? What kind of data structure were you taking into account?

What I have done with my students is that we were trying to identify duplicate PRs. For that, we decided to use the modern thing, so DL: we first removed the comments and then we tokenized, and that was it. The main problem with duplicate detection is that you don’t have many examples to train on. What we have done is that we created pairs of PRs that we found were duplicate, and pairs that were not, and within 50/50 distributions of those we had very high accuracy. The problem was that when we ran this on a real system, the accuracy dropped significantly, down to 60–61%, so not particularly impressive. Detecting duplicates is hard because it is outlier detection. Our research suggests it’s around 4% of PRs. I think it is a bit like comparing faces. Two pieces of code might be looking quite different when they are actually the same, like two pictures of the same person might look different, but be of the same person

Have you ever considered taking that approach, or do you know people who have?

I’ve never thought of it like that, but now that you mention it, if you think about face detection, to the best of my knowledge people are trying to find points in the face, and measure the analogies between those points, to see if a given picture is a face, or a cat, etc. I think you can do exactly the same with the code graphs. Perhaps you don’t even need to use DL, at the beginning you could just do some rough similarity kind of thing.

What was your favorite paper/dataset at MSR this year?

I only saw the presentations on the first day, the next one I was busy giving some tutorials, but what impressed me the most was this paper about atoms of confusion, they identified some particularly complex things for developers to understand, Prevalence of Confusing Code in Software Projects — Atoms of Confusion in the Wild. You know the examples that they gave were this “go-to” thing: an if statement with no braces and then two “go-to” statements, and of course the second one was always executed. This is the thing I like the most, using very small patterns to see how that affects the readability of source code. I found it very impressive that those things that are very common in source code are actually shown to be a crappy way of doing things.

As someone that was managing one of the tracks, what do you expect to see on your track, or in the rest of MSR, next year? What are the topics you think people are going to be bringing more?

Well, that’s obviously driven by the stuff I’m doing, so dependencies, I think it’s a bit overlooked and there is an opportunity to do stuff. Code reviews is another thing that is sort of missing, I mean it’s not completely missing, but automation of code reviews is very important and actually doable, there are several papers that automate various aspects of evaluating PRs, but somebody needs to put everything together to create something that actually does all those things.

What do you think people will actually submit?

It’s hard to say. The recent trend has been understandability. Actually, how people comprehend code and things like that. MSR used to be very technical and now we see some qualitative work, mostly things that have to do with developers, and surveys of developers, how they perceive stuff happening in software repositories. I don’t know what else.

Learn More about MSR 2019 and MLonCode: