In this post, I will explain how we solved several problems in order to train neural networks with Tensorflow 2.0 on several local GPUs in our ML cluster.


The Research and Production Machine Learning teams at source{d} have a dedicated R&D cluster. Every node in that cluster includes 4 GPUs - NVIDIA GeForce 1080 Ti with 11 GB of memory. The cluster is managed by Kubernetes and Terraform. Until recently, we trained models on each GPU separately for  two reasons:

  1. The teams have historically been utilizing the cluster for mostly research.
  2. Our use-cases were driven by hyper-parameter optimization.

In short, hyper-parameter optimization determines the "meta" parameters of the model, for example, the number of neurons in each layer or the best learning rate. One has to conduct many independent experiments, which perfectly match the "one model - one GPU" schema. So why did we dig in another direction?

There is one important advantage of being able to train models on several GPUs at once: the models can be bigger. Before going deeper into this, let's speculate what are the current applied deep learning research pain points.

Long training time is not a problem unless we are comparing numbers that differ in several orders of magnitude, e.g. hours to weeks. If we discussed large scale production deployments, then, of course, every second would matter, but we transition from 0 to 1, not from 1 to 10. That's why the raw performance of generic deep learning frameworks like PyTorch or Tensorflow has never been a distinguishing feature.

There are many things which are more important than the fastest training speed:

  • Community: responsive experts, good documentation, educational resources. If there is no community then the framework is dead because nobody is using it.
  • Smooth learning curve. I assume that one of the reasons why PyTorch has become so popular is its "pythonic" API that feels familiar to many. The mirrored changes in TensorFlow 2.0 have justified that.
  • Full stack. Designing the NN architecture is the smallest part of the work. The way you pipe the data inside is the typical bottleneck that so many newcomers underestimate. Good data loading, preprocessing, augmentation have to be implemented, and the framework should help.
  • Pretrained models. Nowadays, training SOTA models for CV and NLP from scratch has become too expensive for small fish, and to little benefit thanks to transfer learning.
  • Memory consumption.

The last point in the list is the answer why multi-GPU matters. Multi-GPU allows increasing the batch size proportionally to the number of GPUs used. The majority of the current SOTA deep learning models at optimal batch sizes cannot be squeezed into ~12 gigs of a GPU that does not cost a fortune. Instead of spending thousands of dollars per month for renting a TPU with 32 GB for bfloat16-s, you can purchase several consumer-grade GPUs and spread the load uniformly. They will likely pay off after the first few months, ignoring the power costs. Not being vendor-locked will be a very nice bonus.

Returning to the beginning: we have considered multi-GPU training because the Gated Graph Neural Networks - the most promising NN architecture to deal with source code - consume much memory. We've struggled with training GGNNs on a single GPU and the only way out, apart from non-trivial optimizations, is to split and distribute the batch.

The rest of the blog post describes our issues with enabling multi-GPU training with Tensorflow.


Each node in our cluster is a typical SuperMicro tower with 8 PCIe slots. They are colored in black and blue, and that really matters: peer to peer memory access works within the same color only. One GPU occupies two slots.

When we used to have only 2 GPUs per tower, of course we plugged them into slots of different colors and then had to reassemble. Now there are 4, and the peer to peer topology is always the same: 2+2.

The cluster is located in Madrid, which is famous for its +40 in August. We placed an A/C inside the room and allowed the air to flow in and out. There has been no overheating yet.


Kubernetes has recently gained support for scheduling GPU resources, but it is not enough for us. We need to share the GPUs between Jupyter instances running on the same node. The number of ML team members is bigger than the number of nodes in the cluster, so the competition for GPUs is unavoidable. Given our MLonCode tasks, GPUs are not always utilized, so we want to compete holistically, not artificially.

Maartje Eyskens of the Infrastructure team has solved that problem. She added the overcommitment support to the official nvidia-gpu-device-plugin. Thanks to this contribution, we always schedule 4 GPUs per Jupyter instance.

If we were to use this cluster for production then we would stick to the vanilla plugin without having to hack anything.


Tensorflow 2.0 was released a few days ago, but the nightly builds have been available for quite a long time. It has distribution strategies to seamlessly distribute computations across workers. Since we target local multi-GPU at this time, our choice is MirroredStrategy. There was a bug with batch normalization under that strategy which we helped to resolve.


We had to investigate the nasty hang which occurred very deep inside native Tensorflow. It happened inside and there was no clue of what went wrong. Google's engineers were helpful and responsive, they suggested that the hang was related to NCCL, NVIDIA's library for inter-GPU communication. We filed another issue in NCCL and spent much time debugging after that. We finally identified the root cause: normal multi-GPU operation is incompatible with Intel's IOMMU driver, so it should be disabled by booting the kernel with intel_iommu=off parameter.

We have learned several new things from our investigation:

  • If something hangs on the native side while CUDA is being used, always read dmesg first.
  • Disabling virtualization in BIOS is not enough for fixing multi-GPU CUDA. You have to disable the corresponding Linux driver explicitly.
  • NVIDIA samples contain p2pBandwidthLatencyTest which prints the peer to peer communication benchmark. For example, this is our output:
P2P Connectivity Matrix
D\D     0     1     2     3
0       1     1     0     0
1       1     1     0     0
2       0     0     1     1
3       0     0     1     1

Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D     0      1      2      3
0  353.83   5.98  11.22  11.15
1   11.10 152.05  10.77  10.89
2   11.14  10.76 355.12  11.37
3   11.17  10.76  11.32 353.83
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D     0      1      2      3
0  354.15  10.28  11.18  11.16
1   10.27 152.29  11.16  11.11
2   11.15  10.85 355.44  10.28
3   11.15  10.94  10.28 354.15
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D     0      1      2      3
0  356.35  19.31  19.64  19.86
1   18.48 245.06  18.73  18.59
2   19.82  19.18 357.36  18.23
3   19.64  20.07  17.72 355.87
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D     0      1      2      3
0  355.76  19.29  19.54  19.30
1   19.30 383.34  20.06  20.02
2   19.97  19.71 355.30  19.30
3   19.67  19.80  19.29 354.31
P2P=Disabled Latency Matrix (us)
GPU     0      1      2      3
0    1.25  11.90  10.32  10.33
1   12.95   1.37  16.49  15.06
2   11.29  11.45   1.27  12.95
3   11.65  10.75  11.12   1.26
CPU     0      1      2      3
0    3.96   9.96   9.93  10.93
1    9.40   4.33   9.26  10.34
2    9.71   9.60   4.06   9.82
3   10.95  10.42  10.49   3.99
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU     0      1      2      3
0    1.25   0.99  12.84  13.39
1    1.01   1.37  11.21  10.36
2   12.39  12.33   1.28   1.07
3   10.39  10.86   1.04   1.27
CPU     0      1      2      3
0    4.14   3.00  10.40  10.54
1    2.79   4.32   9.66  10.71
2    9.72  10.03   4.08   2.75
3   10.64  11.25   3.08   4.12

It is interesting that for us peer to peer GPU communication does not increase the bandwidth but decreases the latency tenfold.

Next steps

There are a number of pending tasks on our todo list: :

  • Setup distributed training across several nodes using the MultiWorkerMirroredStrategy.
  • Compare the performance to Horovod.
  • Figure out how to distribute training in PyTorch. Please speak up to us if you have the experience!

More about source{d} and MLonCode