This article is the seventh episode of our MSR Interview blog series. You can find our previous interviews with MSR researchers on the source{d} blog. This week’s episode is an interview with Massimiliano Di Penta who is an associate professor at the Department of Engineering, the University of Sannio in Benevento (Italy). Below you can find some of Massimiliano’s publications about source code analysis and Machine Learning below:

1. Can you please introduce yourself and tell us what is your relationship with MSR?

My name is Massimiliano Di Penta, I am an associate professor at the Department of Engineering, University of Sannio close to Benevento in Italy. I won the best paper award at MSR 2007. I've been program co-chair at MSR in 2012 and 2013 and I've been the general chair in 2015. I'm also in the steering committee since 2012.

2. What was your paper about when you won this award in 2007?

The paper was about Ldiff. It is a better line-based diff tool that overcomes the limitations of regular diff tools that may provide an erroneous estimate of the amount of code changed. It has been a very useful tool as it has been applied in many studies to test historical issues. It is text-based though, no AST here.

3. One of your latest paper is on different representations of code to detect code clones, can you tell us a little bit more about it and your conclusions? What representations fit the most source code analysis?

Source code can be represented at different levels of abstraction: identifiers, Abstract Syntax Trees, Control Flow Graphs, Bytecode. And we figure out that regarding the code similarity tasks all those representations are highly complementary. If you want to have a very comprehensive view of code similarities, you have to take into account different views of the same code fragment, enabling more reliable detection of similarities in code in that case.

4. Do you think Deep Learning has a key role in combining those representations?

DL is a way for learning over deep representations. If you have a multi-level representation, it would be more difficult to determine what are the useful features that contribute to the finality. That's why in my opinion, DL works is a good fit for feature engineering.

5. Have you reached similar conclusions working on other Software Engineer tasks?

Some years ago, we made some studies about refactoring, the paper has been published at ICSE in 2013. we found that once again different representations of code e.g. static, dynamic or semantic, carry different information. So if you want to recommend refactoring, you should keep into account not only syntactical but also semantical information of code. However, we also found that depending on the task one type of code representation might carry more information. For example, regarding the refactoring task, we saw that the semantic representation that focuses on the textual side of the code better captures the relationship between the different artifacts of the code.

6. In your work, you seem to mainly focus on Java, why? Do you have any plan to work with others? Getting those different representations of code might be very time consuming depending on the programming language you consider.

This is a very important problem. Java is no longer the most popular language. I'm starting to see people developing analyzers for Python and JavaScript and this is the way the community should go: develop tools to analyze different programming languages. We can not stick to Java anymore. Of course, in case of program repair for example, if you work at token-level, you don't care about the programming language as you only work with the raw source code. However, if you want to analyze the diff with source code, and you don't want to go at line-level, you need a tool that is language-dependent. For instance, in some of our work, we are using GumTree, a diff tool working at the abstract syntax tree granularity but this is only for Java. That's why we need tools that are able to work with other languages as well. In short, tools that are able to support multiple languages are very important today. In order to develop relevant research, you have to analyze way beyond Java. JS is the most popular language right now.

7. You talked about dynamic analysis, does it bring real additional value, is it worth considering this source of information knowing how expensive it can be to compile code?

In terms of runtime information, these days it has become quite easy to compile, most of the code comes with build scripts. There is still going to be information that you can not capture without dynamic analysis. For sure, dynamic analysis is expensive, however, today, most of the systems are equipped with test cases, build tools so it is not that expensive at the end. In the paper about code similarities, we consider 4 different representations of code, and one is dynamic. Dynamic analysis is important for 2 reasons. First, it is able to capture dependencies and then it will allow you to detect logical couplings between code artifacts. 

8. Do you think at some point DL can replace manual feature engineering?

Very interesting, this has been discussed in a workshop in Montreal this year on the potential of AI in SE. In one sense, DL is very powerful in learning features, but on the other hand as researchers, you should be aware of the risks, you may obtain a model trained on a large amount of data that works very well. The problem is that you don't know why, and because of that it might be difficult to convince software engineers of the features to adopt this model. The lack of explainability might limit the adoption of the models. We need transparency in the way models output decisions. In general, I would say that DL would be particularly useful where feature engineering is difficult. But in problems where we have a clear set of features, we'd better use regular ML algorithms and obtain results that outperform what we would have obtained using Deep Learning.

9. What are you working on right now?

Currently, with my students, I'm mostly working on Continuous Integration, relationships between CI and static analysis. And in general, anything that could help developers during software engineering development.

10. At source{d}, we are developing ML solutions to help developers with automated code reviews, have you heard of anybody working on such topic? Do you think DL could yield to good results?

I remember Alberto Bacchelli and his group have published several papers around this topic, about triage on code reviews; to learn on code reviews to see who can be the most relevant reviewers. As regards DL, it depends on what kind of recommendations you want to suggest. For example, you may be able to detect style inconsistencies, and DL may allow you to learn properties and patterns about this particular code, for then automatically fix those style problems.

11. In your opinion, which research areas are going to leverage attention in the coming years?

I believe program repair, in general, is going to be a hot topic in the future. I expect more papers about it in the next conferences like MSR. Another topic more practical that, in my opinion, is going to become more important is code search; to be able to give more sophisticated recommendations to developers based on the availability of huge amounts of code, discussions in Stack Overflow and so on.