This article is the third episode of our MSR Interview blog series. Here are the links to the first two episodes featuring Abraham Hindle and Georgios Gousios. This week, we’re publishing the interview with Vasiliki Efstathiou, a researcher at Athens University of Economics and Business. She has authored two papers submitted at MSR’18 and ICSE’18. Thanks to Waren Long and Vadim Markovtsev and Francesc Campoy for conducting the interview.

Vasiliki Efstathiou, Athens University of Economics and Business

Vasiliki Efstathiou has two major publications on Mining Repositories:

Can you please introduce yourself? Why do you visit MSR and ICSE?

My name is Vasiliki Efstathiou. I am a postdoctoral researcher at the Athens University of Economics and Business. My main focus is on natural language processing and software engineering artifacts. I presented a short paper here, at MSR and another paper at ICSE New Ideas and Emerging Results track. I’m relatively new to software engineering.

You mentioned two papers you presented here. Can you tell us about the MSR paper?

The MSR one was a data track paper, Word Embeddings for the Software Engineering Domain. It was about a pre-trained model that people can be used for performing natural language processing tasks with software engineering artifacts. The idea is that you need domain-specific knowledge in order to treat such natural language artifacts and you can use this model as background knowledge to disambiguate costly conflicts with notions that have completely different meanings in the domain.

Could you tell us more about the second paper as well?

It reflects the approaches of the past on how software engineers grade what is useful for code review and try to map them into linguistic features and theories. That could possibly help in grounding features of useful code review, according to the empirical evidence, into actual concrete words in the text of the review comments.

Your papers mention textual footprint of the repositories: what features can you use to define those feature footprints?

There is source code, which is an artificial language. In terms of natural language, descriptive data or communicative data, you can have source code metadata and specifications. Each of these can point to various directions of research. For example, effective communication, sentiment analysis, topic modeling. There is already a lot of research in sentiment analysis possibly due to the existence of readily available tools. It is good to have practical tools that practitioners from different fields can easily use in order to process text.

As you said, the source code itself can be used as a textual footprint. Do you like the idea of taking any textual information from the source code itself, such as comments or identifiers, and train embeddings on top them? Do you think it would help?

Yes, definitely. Identifiers have their own semantic information, they carry mostly higher level semantics. There are patterns in the way we write code, the way we name code artifacts. I think you can do an analysis as you do in text. The source code itself is an artificial language, so you probably can find analogies with natural language and tune accordingly. Impressive results are already coming out of the software engineering literature. Along these lines, training meaningful models for identifier embeddings seems plausible.

As you said, the domain on which we train our embeddings is important and we just cannot transfer general purpose embeddings to the software domain. Do you think that if we go very precise and very selective in training our model, the embeddings can overfit and be too specific?

Yes, this can happen. For the model we trained for MSR, we selected the resources which, we believe, contain enough technical and abstract information. We show that it also captures developer slang, it doesn’t only capture very technical terms. In summary, I think you need to select your context very carefully — in some cases, you may need some very task-specific resources ad-hoc for something very specific.

Do you know how to avoid bias in the data? For example in the field of computer vision, if you train your face recognition model on white people it might have troubles detecting people of other race, and the same may happen to software engineering models.

I don’t know if interdisciplinarity could help here. Normally we are very focused on the problem we are facing and we do not focus on the bigger picture.

Another example of bias could be: we know that the majority of software engineers are men, so the model can overfit to the way men communicate in code review. Did you try to analyze such cases?

I leave it for future research. Sometimes the bias comes naturally, like in the example you gave the model will learn from comments expressed mainly by male software engineers and, if any peculiarities apply, it will be representative of this reality.

Do you know any other research groups or people that are trying to apply machine learning research to the source code?

I have seen many papers where people try to detect patterns, personalize, propose better autocompletion suggestions. Yesterday I saw a presentation about identifying coders based on the code they write.

Just to come back to the topic of the bias in the data: we can fall into pitfalls when training task-specific models — do you think that data provided on GitHub may not be representative for software engineering as a whole?

It is possible, but it strongly depends. Sometimes you want the bias when you train domain-specific models, in that case, you are chasing the bias. GitHub is vast, I could not really tell whether it is representative of software engineering as a whole, but at least when it comes to open source software, I would assume that it reflects reality to an extent.

MSR started with a keynote about joining the ICPC and MSR communities together. Do you think there are any other issues that should be addressed next year?

As a newcomer, attending MSR for the first time, from a not very technical field I’d like to see more diversity in terms of topics and participants. I really appreciated the keynote that stressed the point of interdisciplinarity, that needs to be addressed. I do not mean diversity for the sake of diversity, I think it is essential. When you mine repositories we talk about different kinds of data and what information they can express. Without cross-disciplinary research, we might miss important and obvious aspects for other professionals. For example, in natural language there is an interplay of grammar and syntax; we may not be aware of how important those are in conveying some message, so consulting a linguist or a computational linguist may help in building better solutions.

Did you see the presentation about comment sentiment analysis into 6 categories?

Yes, I did. Again, the example the authors gave “I am afraid”, which very often expresses doubt rather than the sentiment of fear, is another case where having an experimental psychologist or a linguist in the team can give you an additional perspective about the data.

What is your next project? Do you plan to work on code review or embeddings?

The paper that we have on code review is about new ideas, there are few results. The next step for us is to put those ideas into practice and measure the results. The paper is about finding out what is an explanatory and justified review by looking at specific textual features proposed by linguistic theories.

You mentioned that we should go to the developers and ask them for opinions on the sentiment analysis. Did you get any feedback on your studies from developers?

No, we have not done any interviews. For our latest studies, we used evidence from existing empirical studies, coming mainly from organizations that have access to developers, and data from open source projects. Developer opinions are valuable but not always easy to obtain.

What did you like most about this MSR?

I liked the keynote and I like the discussion after the paper about small confusion elements in code and how they propagate into the source code.

What do you think will be the level of automation in software development?

I expect that the level of automation will increase significantly, considering the thriving progress in machine learning research and the growing availability of software data in online open source repositories.

Learn More about MSR 2019 and MLonCode: