Multiple GPUs for graphics and deep learning
For long time I have been using a good old nvidia GeForce GTX 1050 for my display and deep learning needs. I reported a few times how to get Tensorflow running on Debian/Sid, see here and here. Later on I switched to AMD GPU in the hope that an open source approach to both GPU driver as well as deep learning (ROCm) would improve the general experience. Unfortunately it turned out that AMD GPUs are generally not ready for deep learning usage.
The problems with AMD and ROCm are far and wide. First of all, it seems that for anything more complicated then simple stuff, AMD’s flagship RX 5700(XT) and all GFX10 (Navi) based cards are not(!!!) supported in ROCm. Yes, you read correct … AMD does not support 5700(XT) cards in the ROCm stack. Some simple stuff works, but nothing for real computations.
Then, even IF they would support, ROCm as distributed is currently a huge pain in the butt. The source code is a huge mess, and building usable packages from it is probably possible, but quite painful (I am member of the ROCm packaging team in Debian, and have tried many hours). And the packages provided by AMD are not installable on Debian/sid due to library incompatibilities.
So that left me with a bit a problem: for work I need to train quite some neural networks, do model selection, etc. Doing this on a CPU is a bit a burden. So at the end I decided to put the nVidia card back into the computer (well, after moving it to a bigger case – but that is a different story to tell). Here are the steps I did to get both cards working for their respective target: AMD GPU for driving the console and X (and games!), and the nVidia card doing the deep learning stuff (tensorflow using the GPU).
Starting point
Starting point was a working AMD GPU installation. The AMD GPU is also the first GPU card (top slot) and thus the one that is used by the BIOS and the Linux console. If you want the video output on the second card you need to trick, and probably don’t have console output, etc etc. So not a solution for me.
Installing libcuda1 and the nvidia kernel drivers
Next step was installing the libcuda1
package:
apt install libcuda1
This installs a lot of stuff, including the nvidia drivers, GLX libraries, alternatives setup, and update-glx
tool and package.
The kernel module should be built and installed automatically for your kernel.
Installing CUDA
Follow more or less the instructions here and do
wget -O- https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub | sudo tee /etc/apt/trusted.gpg.d/nvidia-cuda.asc
echo "deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /" | sudo tee /etc/apt/sources.list.d/nvidia-cuda.list
sudo apt-get update
sudo apt-get install cuda-libraries-10-1
Warning! At the moment Tensorflow packages require CUDA 10.1, so don’t install the 10.0 version. This might change in the future!
This will install lots of libs into /usr/local/cuda-10.1
and add the respective directory to the ld.so path by creating a file /etc/ld.so.conf.d/cuda-10-1.conf
.
Install CUDA CuDNN
One difficult to satisfy dependency are the CuDNN libraries. In our case we need the version 7 library for CUDA 10.1. To download these files one needs to have a NVIDIA developer account, which is quick and painless. After that go to the CuDNN page where one needs to select Archived releases
and then Download cuDNN v7.N.N (xxxx NN, YYYY), for CUDA 10.1
and then cuDNN Runtime Library for Ubuntu18.04 (Deb)
.
At the moment (as of today) this will download a file libcudnn7_7.6.5.32-1+cuda10.1_amd64.deb
which needs to be installed with dpkg -i libcudnn7_7.6.5.32-1+cuda10.1_amd64.deb
.
Updating the GLX setting
Here now comes the very interesting part – one needs to set up the GLX libraries. Reading the output of update-glx --help
and then the output of update-glx --list glx
:
$ update-glx --help
update-glx is a wrapper around update-alternatives supporting only configuration
of the 'glx' and 'nvidia' alternatives. After updating the alternatives, it
takes care to trigger any follow-up actions that may be required to complete
the switch.
It can be used to switch between the main NVIDIA driver version and the legacy
drivers (eg: the 304 series, the 340 series, etc).
For users with Optimus-type laptops it can be used to enable running the discrete
GPU via bumblebee.
Usage: update-glx
Commands:
--auto switch the master link to automatic mode.
--display display information about the group.
--query machine parseable version of --display .
--list display all targets of the group.
--config show alternatives for the group and ask the
user to select which one to use.
--set set as alternative for .
is the master name for this link group.
Only 'nvidia' and 'glx' are supported.
is the location of one of the alternative target files.
(e.g. /usr/lib/nvidia)
$ update-glx --list glx
/usr/lib/mesa-diverted
/usr/lib/nvidia
I was tempted into using
update-glx --config glx /usr/lib/mesa-diverted
because at the end the Mesa GLX libraries should be used to drive the display via the AMD GPU.
Unfortunately, with this neither the nvidia kernel module was loaded, the nvidia persistenced couldn’t run because the library libnvidia-cfg1
wasn’t found (not sure it was needed at all…), and with that also no way to run tensorflow on GPU.
So what I did I tried
update-glx --auto glx
(which is the same as update-glx --config glx /usr/lib/nvidia
), and rebooted, and decided to check afterwards what is broken.
To my big surprise, the AMD GPU still worked out of the box, including direct rendering, and the games I tried (Overload, Supraland via Wine) all worked without a hinch.
Not that I really understand why the GLX libraries that are seemingly now in use are from nvidia but work the same (if anyone has an explanation, that would be great!), but since I haven’t had any problems till now, I am content.
Checking GPU usage in tensorflow
Make sure that you remove tensorflow-rocm and reinstall tensorflow with GPU support:
pip3 uninstall tensorflow-rocm
pip3 install --upgrade tensorflow-gpu
After that a simple
$ python3 -c "import tensorflow as tf;print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
....(lots of output)
2020-09-02 11:57:04.673096: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3581 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:05:00.0, compute capability: 6.1)
tf.Tensor(1093.4915, shape=(), dtype=float32)
$
should indicate that the GPU is used by tensorflow!
The R Keras package should also work out of the box and pick up the system-wide tensorflow which in turn picks the GPU, see this post for example code to run for tests.
Conclusion
All in all it was easier than expected, despite the dances one has to do for nvidia to get the correct libraries. What still puzzles me is the selection option in update-glx, and might need a better support for secondary nvidia GPU cards.
2 Responses
[…] of my computer from one case to another bigger one, the reason being I needed to plug in my nvidia GPU card for deep learning. In the process, somehow I lost (temporarily) the ability to connect one PCIe NVMe converter and […]
[…] back to using my nVidia card for deep learning, and use the AMD for the graphic output. See this blog for details on how to do multiple GPU […]