This article is the fourth episode of our MSR Interview blog series. After Abram Hindle, Georgios Gousios and Vasiliki Efstathiou, this week’s episode is an interview with Sarah Nadi who is an assistant professor at the University of Alberta. Thanks to Waren Long and Vadim Markovtsev and Francesc Campoy for conducting the interview.

Sarah Nadi

Below You can find some of Sarah’s publications on Mining Repositories:

Can you please introduce yourself?

I’m Sarah Nadi, I’m currently an Assistant Professor at the University of Alberta. I did my Ph.D. and Masters at the University of Waterloo and then did a postdoc at TU Darmstadt in Germany. What I generally work on is finding or developing techniques for supporting developers in their software maintenance and reuse activities. Some of the things I’ve worked on for example is helping developers to use APIs correctly, that is to say, telling them if they’ve done something wrong or help them select a library for a specific task. I’ve also worked on the area called Software Product Lines (SPLs) where you have multiple versions of your system and you want to manage them, consolidate them, find inconsistencies in configurations, etc.

Did you tackle this as an automated thing? For instance when you tell developers what version they should be using?

For APIs, we’ve built an API misuse detection tool. We basically mine a lot of code, create a code pattern out of this code, and then detect violations of the pattern. That’s an automated tool. We also did other things on decision support for the library selection. Right now, we mine metrics from repositories and information about the libraries and then present it to you so that you can decide. Here’s the website we have for the library selection support (but it needs some updating). So I do both: automated tools and some more decision support.

How many times, have you been at MSR so far?

I’ve attended MSR 5 times starting in 2013 in San Francisco. This year I’m co-chairing both the data and mining challenge tracks. I also have one of my Master’s students, Mehran Mahmoudi, presenting a paper this year.

Could you tell us a little bit more about the Mining Challenge?

Every year when the Mining Challenge happens, there are some curated datasets that are supposed to be general in the sense that there is many different kinds of research questions you can ask about it. They are released and people are invited to do some research on top of it. This year, our data set is mainly the outcome of Sebastian Proksch’s Ph.D., who I co-supervised at TU Darmstadt, and who’s now a postdoc at the University of Zurich. Sven Amann, who I also co-supervised as a Ph.D. student in Darmstadt, also contributed a lot to building the tooling for curating this data set. The data set is mainly about the developer’s IDE interactions, specifically in visual studio.

What are the biggest challenges that people are trying to solve using this data and which is the most interesting for you?

One of the biggest challenges with helping developers is that you need to take the context of what they are working on into account so that you avoid false positives. If you start telling developers that they are doing something wrong and it turns out to all be false positives, they are going to be frustrated really quickly. It’s hard to do data-driven approaches correctly, especially when a lot of the public data we mine may be incorrect in the first place. You also need to consider context more like what the application the developer is working on is and if the warning you are about to give them makes sense in that context or not. This is hard to do.

Do you have any paper in mind dealing with this issue, trying to minimize those false positives among the prediction?

Not off the top of my head now. There has been work that looks on what you do inside the IDE and creates profiles for you by Gail Murphy (Using task context to improve programmer productivity). I believe that a variation of that could be one option to reduce false positives based on developers profiles.

Have you read the paper called Lessons from Building Static Analysis Tools at Google?

There are a lot of papers that are qualitatively investigate usages of static analysis tools (e.g., Johnson’s ICSE ’13 work) and they report that false positives are a big issue. What can you exactly do about it? There have been different approaches that either add context or more precision to make the tools better but it’s very hard: what is true for one tool may not necessarily be true for another tool.

How do you think data could help developers to write better code?

I think there is a lot to gain by looking at the data, finding trends that we can tell developers about. For example, if I know that this way of coding or this process causes more defects, then I can warn you. I can tell you this is a code smell and help you improve it and that’s something you can get out of the data. For what we’ve done with misuse detection is that we have found the patterns by looking at the data. Obviously, there is also a risk there, because with data-driven approaches you have the rule of the majority which is often not right. For example, in security APIs, we figured out that the majority was wrong, which means using data-driven approaches for security becomes much harder. So, there are trade-offs; you could learn things but you could also be learning things that are incorrect. If the thing you’re studying is new and everybody is doing it wrong, then you’re also learning something wrong, which is one thing we are actually struggling with. One option could be to triangulate many sources. For example, if you do pattern mining for code, maybe correlate it with what people are saying on Stack Overflow or technical blogs and see if it matches or not. By triangulating from many sources, you reduce the error of the majority doing it right or wrong. Obviously, it’s still not precise but at least you reduce the potential errors.

What do you think about the concept of doing Machine Learning on top of source code?

The idea of applying ML to source code is gaining momentum and it has a lot of potentials. However, there is also a lot of risks in the sense that code has a lot of context and semantics and by just creating code blindly as text, we might be losing a lot of that semantics. Now it depends on what the application is. For something like predicting the next token (e.g., Abram Hindle’s work), it does work, but if I’m going to learn an exact code snippet to recommend to you, given code you’ve already written, or help you to fix pieces of code, then there are more semantics to consider because there are also alternative coding patterns.

Do you have any straightforward application of ML on Code?

Most of the straightforward ideas have already been implemented I would say. Associating events, commit patterns, things like evolution patterns: mining this kind of trends and sequences has been done. As usual with ML, it depends on the features you extract; they have to make sense for the context of the code you are studying. Nowadays, we also have additional data such as conversations in pull requests, which can then be associated with the code in the commits. All of this is valuable information. Again we come back to this idea of triangulating different type of information and looking at it.

What is the most interesting idea you’ve seen at MSR so far?

Given that I was involved in the data track, one thing that caught my attention this year at MSR is the security vulnerability dataset Vulinoss: A Dataset of Security Vulnerabilities in Open-source Systems. I think security is very important and is increasingly gaining attention, so in that context, this dataset looks promising.

Learn More about MSR 2019 and MLonCode: