NVLINK on RTX 2080 TensorFlow and Peer-to-Peer Performance with Linux



NVLINK is one of the more interesting features of NVIDIA’s new RTX GPU’s. In this post I’ll take a look at the performance of NVLINK between 2 RTX 2080 GPU’s along with a comparison against single GPU I’ve recently done. The testing will be a simple look at the raw peer-to-peer data transfer performance and a couple of TensorFlow job runs with and without NVLINK.

For most people outside of the HPC world NVLINK is unfamiliar. The GeForce RTX Turing cards are the first to have this high performance connection on a “consumer” GPU. NVLINK is the high performance GPU-to-GPU interconnect fabric that was first used on motherboards for server gear using NVIDIA’s SXM GPU modules. The Quadro P100, GV100 and upcoming Quadro RTX cards also have NVLINK. The cards use a bridge connector similar to an SLI bridge. In fact the NVLINK bridge on the RTX 20xx series is used to provide SLI capability. The NVLINK implementation on the RTX 2080 and 2080Ti is a full NVLINK-2 implementation but is limited to 1 “link” (a.k.a. “brick”) on the RTX 2080. It looks like there is 2 “links” on the RTX 2080 Ti but I haven’t confirmed that they are aggregated yet (still waiting on a second card). The server Tesla SXM modules have 6 NVLINK-2 “links”.

Note: on IBM Power 8,9 architecture NVLINK is also a high performance interconnect from GPU-to-CPU. That is the hardware used on the Oak Ridge National Laboratory Summit Supercomputer — the fastest computer in the world right now.

My colleague William George has done some testing with NVLINK on Windows 10 and at this point it doesn’t appear to be fully functional on that platform. You might want to check out his post NVIDIA GeForce RTX 2080 & 2080 Ti Do NOT Support Full NVLink in Windows 10. I think you can expect that to change soon. It should be fully functional on Windows 10 after a round of updates.

NVLINK is fully functional with the RTX 2080 on Ubuntu 18.04 with driver version 410.

I will be doing testing similar to what I did in the post NVIDIA RTX 2080 Ti vs 2080 vs 1080 Ti vs Titan V, TensorFlow Performance with CUDA 10.0. I’ll include some results from that post for comparison.


Test system

Hardware

  • Puget Systems Peak Single
  • Intel Xeon-W 2175 14-core
  • 128GB Memory
  • 1TB Samsung NVMe M.2
  • GPU’s
  • GTX 1080Ti
  • RTX 2080 (2)
  • RTX 2080Ti
  • Titan V

Software

Two TensorFlow builds were used since the latest version of the TensorFlow docker image on NGC does not support multi-GPU for the CNN ResNet-50 training test job I like to use. For the “Big LSTM billion word” model training I used the latest container with TensorFlow 1.10 linked with CUDA 10.0. Both of the test programs are from “nvidia-examples” in the container instances.

For details on how I have Docker/NVIDIA-Docker configured on my workstation have a look at the following post along with the links it contains to the rest of that series of posts. How-To Setup NVIDIA Docker and NGC Registry on your Workstation – Part 5 Docker Performance and Resource Tuning


There is one link available:

  • Link 0, P2P is supported: true
  • Link 0, Access to system memory supported: true
  • Link 0, P2P atomics supported: true
  • Link 0, System memory atomics supported: true
  • Link 0, SLI is supported: true
  • Link 0, Link is supported: false

I believe that the “Link is supported: false” line is referring to the CPU-GPU connection on IBM Power arch. I’m not completely sure about that since I can not find any information about it.

This test provides some additional information and the performance of a CUDA memory copy from GPU to GPU. (I’m listing only additional information from the output in the form of questions and answers.)

Does NVIDIA GeForce RTX LNVINK support Peer-To-Peer memory access?

Checking GPU(s) for support of peer to peer memory access…

  • Peer access from GeForce RTX 2080 (GPU0) -> GeForce RTX 2080 (GPU1) : Yes
  • Peer access from GeForce RTX 2080 (GPU1) -> GeForce RTX 2080 (GPU0) : Yes
    Enabling peer access between GPU0 and GPU1…

Does NVIDIA GeForce RTX LNVINK support Unified Virtual Addressing (UVA)?

Checking GPU0 and GPU1 for UVA capabilities…

  • GeForce RTX 2080 (GPU0) supports UVA: Yes
  • GeForce RTX 2080 (GPU1) supports UVA: Yes
    Both GPUs can support UVA, enabling…

How Fast is NVIDIA GeForce RTX LNVINK for CUDA Memory Copy?

  • cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 22.53GB/s

The following output show detailed information on bandwidth and latency between the 2 RTX 2080 GPU’s.

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]


P2P Connectivity Matrix
     D\D     0     1
     0	     1     1
     1	     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 389.09   5.82
     1   5.82 389.35
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1
     0 386.63  24.23
     1  24.23 389.76
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 386.41  11.59
     1  11.57 391.01
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 382.58  48.37
     1  47.95 390.62
P2P=Disabled Latency Matrix (us)
   GPU     0      1
     0   1.67  20.55
     1  11.36   1.64

   CPU     0      1
     0   4.01   8.29
     1   8.37   3.65
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1
     0   1.67   0.92
     1   0.92   1.64

   CPU     0      1
     0   3.70   2.79
     1   2.95   3.68

For 2 RTX 2080 GPU’s with NVLINK we see,

  • Unidirectional Bandwidth: 24 GB/s

  • Bidirectional Bandwidth: 48 GB/s

  • Latency (Peer-To-Peer Disabled),

    • GPU-GPU: 11-20 micro seconds
  • Latency (Peer-To-Peer Enabled),

    • GPU-GPU: 1 micro seconds

Now on to something a bit more “real-world”.

The convolution neural network (CNN) and LSTM problems I’ll test will not expose much of the benefit of using NVLINK. This is because their multi-GPU algorithms achieve parallelism mostly by distributing data as independent batches of images or words across the two GPU’s. There is little use of GPU-to-GPU communication. Algorithms with finer grained parallelism that need more direct data and instruction access across the GPU’s would benefit more.

One of the interesting questions this part of the testing will address is “is it better to get 2 RTX 2080 vs one of a more expensive card?”. Lets find out.

I am mostly testing with benchmarks that I used in the recent post “NVIDIA RTX 2080 Ti vs 2080 vs 1080 Ti vs Titan V, TensorFlow Performance with CUDA 10.0“. However, for the CNN I am using an older NGC docker image with TensorFlow 1.4 linked with CUDA 9.0 and NCCL. I’m using this in order to have multi-GPU support utilizing the NCCL communication library for the CNN code. The most recent version of that code does not support this. The LSTM “Billion Word” benchmark I’m running is using the newer version with TensorFlow 1.10 link with CUDA 10.0.
I’ll give the command-line input and some of the output for reference.

TensorFlow CNN: ResNet-50

Docker container image tensorflow:18.03-py2 from NGC,

docker run --runtime=nvidia --rm -it -v $HOME/projects:/projects nvcr.io/nvidia/tensorflow:18.03-py2

Example job command-line and truncated startup output, (with NVLINK bridge)

NGC/tensorflow/nvidia-examples/cnn# python nvcnn.py --model=resnet50 --batch_size=64 --num_gpus=2 --fp16

TensorFlow:  1.4.0
This script: nvcnn.py v1.4
Cmd line args:
  --model=resnet50
  --batch_size=64
  --num_gpus=2
  --fp16
Num images:  Synthetic
Model:       resnet50
Batch size:  128 global
             64 per device
Devices:     ['/gpu:0', '/gpu:1']
Data format: NCHW
Data type:   fp16
Have NCCL:   True
Using NCCL:  True

...
2018-10-11 01:01:05.405568: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Device peer to peer matrix
2018-10-11 01:01:05.405598: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] DMA: 0 1
2018-10-11 01:01:05.405604: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 0:   Y Y
2018-10-11 01:01:05.405609: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 1:   Y Y
...

Note, that –fp16 means “use Tensor-cores”.

I ran that job at FP32 and FP16 (Tensor-cores) both with and without NVLINK on the 2 RTX 2080 GPU’s.

GPU FP32
Images/sec
FP16 (Tensorcores)
Images/sec
GTX 1080 Ti 207 N/A
RTX 2080 207 332
RTX 2080 Ti 280 437
Titan V 299 547
2 x RTX 2080 364 552
RTX 2 x 2080+NVLINK 373 566

ResNet-50 with RTX NVLINK


TensorFlow LSTM: Big-LSTM 1 Billion Word Dataset

Docker container image tensorflow:18.09-py3 from NGC,

docker run --runtime=nvidia --rm -it -v $HOME/projects:/projects nvcr.io/nvidia/tensorflow:18.09-py3

Example job command-line and truncated startup output, (no NVLINK bridge)

/NGC/tensorflow/nvidia-examples/big_lstm# python single_lm_train.py --mode=train --logdir=./logs --num_gpus=2 --datadir=./data/1-billion-word-language-modeling-benchmark-r13output/ --hpconfig run_profiler=False,max_time=90,num_steps=20,num_shards=8,num_layers=2,learning_rate=0.2,max_grad_norm=1,keep_prob=0.9,emb_size=1024,projected_size=1024,state_size=8192,num_sampled=8192,batch_size=256

...
2018-10-10 22:43:54.139543: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-10-10 22:43:54.139577: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0 1
2018-10-10 22:43:54.139583: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N N
2018-10-10 22:43:54.139603: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1:   N N
...
GPU FP32
Images/sec
GTX 1080 Ti 6460
RTX 2080 (Note:1) 5071
RTX 2080 Ti 8945
Titan V (Note:2) 7066
Titan V (Note:3) 8373
2 x RTX 2080 8882
2 x RTX 2080+NVLINK 9711

LSTM with RTX NVLINK

  • Note:1 With only 8GB memory on the RTX 2080 I had to drop the batch size down to 256 to keep from getting “out of memory” errors. That typically has a big (downward) influence on performance.
  • Note:2 For whatever reason this result for the Titan V is worse than expected. This is TensorFlow 1.10 linked with CUDA 10 running NVIDIA’s code for the LSTM model. The RTX 2080Ti performance was very good!
  • Note:3 I re-ran the “big-LSTM” job on the Titan V using TensorFlow 1.4 linked with CUDA 9.0 and got results consistent with what I have seen in the past. I have no explanation for the slowdown with the newer version of “big-LSTM”.

This might not be an easy decision. I haven’t done very much testing and I’m still waiting for cards to do multi-GPU testing with the RTX 2080 Ti. I suspect that a multi-GPU configuration with the RTX 2080 Ti’s will become the new standard workstation setup for folks doing ML/AI work. I’m a little reluctant to recommend the RTX 2080 because of the 8GB memory limitation. It is however obviously a great card! If that was what your budget allowed for then you would still be getting solid performing card.

The NVLINK bridge is a nice option for use with two cards, but, just two cards. There will be use cases where the NVLINK bridge will have a significant impact but a lot of GPU code is optimized to minimize communication. This is especially true for ML workloads where parallelism is often achieved by doing data distribution across devices. Still, it is a very nice option to have! I will be looking for usages that highlight the advantages it offers for future “real-world” testing.

We have a lot of great GPU’s to do computational work with right now. These RTX cards give great compute performance for the cost. And, then there is the Titan V for when you need to have that wonderful double precision (FP64) performance of the Volta architecture.

It shouldn’t be too much longer before I get a chance to look at performance with multiple RTX 2080 Ti’s. Really looking forward to that!

Happy computing! –dbk