From Import2vec - Learning Embeddings for Software Libraries by Bart Theeten, Frederik Vandeputte, Tom Van Cutsem, MSR 2019. Reproduced with permission.

Introduction

In this day and age, a simple `create-react-app` pulls over 1500 dependencies: it’s safe to say that analyzing those dependencies is becoming crucial to managing complexity, reducing technical debt, ensuring compliance, addressing security issues and many other critical tasks for most software companies.

Bart Theeten, Frederik Vandeputte and Tom Van Cutsem, at Nokia Bell Labs in Antwerp, Belgium, have recently published a paper at MSR 2019 to help solving this problem: they model dependencies based on their usage context, very much like we model words in Natural Language Processing.

Model

Import2vec uses the widespread word2vec model to produce embeddings. If you’re not sure what word2vec is, check out our blog post on building source code identifiers embeddings or this tensorflow tutorial for in-depth explanations.

Here, the context used to model libraries is external library imports, either at the file level or at the project level. The contextual imports are selected to be the most relevant: only the imports that are sufficiently used at the global scale (e.g. across an ecosystem) are kept and the most popular imports are discarded—they are likely to be the core language libraries and to not convey any meaningful information.

Playing with the model

The authors of Import2vec at Nokia Bell Labs did a great job with sharing their research: they built Code Compass, a library navigation tool which has VS Code integration, and shared their trained model and example code to explore the embeddings on Zenodo.

Screenshot of Code Compass in action in VS Code, with recommended libraries in the Machine Learning category for some PyTorch code. The presented alternatives are indeed very relevant in the industry.

You can download all the trained embeddings and use them directly for a specific task or quickly explore them in a notebook. It is a great example of the evolving standards in the scientific community regarding reproducibility.

Next steps

A new challenger is approaching

One of the major objectives of Import2vec is to provide insight into the tangled space of a software ecosystem. Applying word2vec to library usage data gets problematic for newcomers: new libraries might have not appeared in open source projects yet—especially in already mined datasets—therefore it is hard to build good embeddings for them. It will be interesting to see how the next papers on the subject address this challenging problem. One interesting option is to additionally consider the contents of the files where libraries are imported, since that information is available for newcomers.

Contextualizing

In Natural Language Processing, we have witnessed a rapid shift from “regular” embeddings like word2vec, GloVe or Swivel to contextualized embeddings like Elmo, BERT, or XLNet. The major objective has been to improve the quality of embeddings for words that have multiple meanings.

Consider the word set in English, for example. It has its own entry in the guinness book and reaches almost 500 documented meanings, depending on which dictionary we consider. A tennis set is very different from a mathematical set, or a tea set. Those nouns are also very different from the verb acceptations of set, which themselves are plenty of. If you train a single embedding for set, it will have to accommodate all those meanings at the same time, poorly. Therefore, it’s extremely interesting to have embeddings that depend on the context of usage: set can have as many different embeddings as required now.

So we could apply this disambiguation to libraries: a given library may play different roles depending on the programming context. Recall pandas: it is capable of data wrangling, performing statistical analysis, displaying plots, etc. In a particular project, however, it might be used for a single purpose. Contextualizing is the key to reflect that in the library embeddings.

Defining libraries

In the paper, the library is coarsely defined as essentially a direct mapping of imports. Although, a single library typically comprises a plenty of imports. Finding out how to automatically split imports in a library is a tougher problem than it seems.

Conclusion

Import2vec is a promising approach to analyze libraries at large scale. We are very excited by the tools that will be built on similar technologies: they will be critical to managing the ever-growing codebases in the world.

If you’d like to investigate Import2vec further, please give this great blog post a read: it is written by the paper authors and provides great insights.

More about source{d} and MLonCode

This post was written by Hugo Mougard.