Home › Forums › FABRIC General Questions and Discussion › Issue with NVIDIA driver on basic_gpu_devices
- This topic has 20 replies, 3 voices, and was last updated 1 year, 6 months ago by Ilya Baldin.
-
AuthorPosts
-
May 31, 2023 at 2:03 pm #4343
When running through the basic_gpu_devices.ipynb journal, I’m having trouble with installing the Nvidia drivers. When I run the given cells, I have most recently receive the error “Curl error (7): Couldn’t connect to server for https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo []”, but have also received the notification that “NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running” when the curl error is not present.
Should there be an updated command/ link to install the most current driver and is that something I can do on my end? Or is this something specific to the site I’m using? The current commands being used to install the NVIDIA driver/ CUDA are:
‘sudo dnf install -q -y epel-release’,
‘sudo dnf config-manager –add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo‘,
‘sudo dnf install -q -y kernel-devel kernel-headers nvidia-driver nvidia-settings cuda-driver cuda’May 31, 2023 at 2:07 pm #4344Can you please let us know which site(s) you are using and received this error?
May 31, 2023 at 2:13 pm #4345It’s set up to select a random site that has rtx6000 available, but the one I most recently used was with MICH.
May 31, 2023 at 2:30 pm #4349It seems that developer.download.nvidia.com can be resolved for only IPv4. We will find out if there is a workaround (and follow up on this thread), but short-term solution is using one the following racks on your experiments.
MAX, TACC, MASS, UCSD, GPN.
These racks provide IPv4 addresses for the VM Management Network.
May 31, 2023 at 2:31 pm #4350Sarah,
Can you post the output of ‘sudo lspci` command?
May 31, 2023 at 2:37 pm #4352… and also
sudo lsmod
May 31, 2023 at 2:54 pm #4355Sure, this is at the TACC site.
May 31, 2023 at 3:27 pm #4367So looks like the card is attached, but I don’t see the nvidia driver loaded in kernel modules, so it probably didn’t get installed properly.
Are you still getting an error from trying to download the repo index (reaching https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo)?
Let’s focus on using your VM at TACC for now. As Mert indicated above, some of the earlier errors on other sites may have been due to the fact that you were ending up on a site with IPv6 management network and NVidia site unfortunately has limited to no IPv6 presence. There are two solutions to that situation:
1. Use a different site, which is what you did – so we will stick with that
2. Update your VM DNS configuration to use something called NAT64 which allows IPv4 and IPv6 networks to communicate and resolve names. There is a notebook that describes how to deal with it called ‘Access non-IPv6 services (i.e. GitHub) from IPv6 FABRIC nodes’ that is listed from the ‘Start Here’ notebook. <- this is just so you know in the future
For now can you get on the console of your TACC VM (via SSH) and run this command to see if it can reach NVidia site:
$ curl https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
If
curl
isn’t available, you can trywget
May 31, 2023 at 4:06 pm #4391When I download from the repo index using the code provided I don’t receive an error and it looks like both the NVIDIA and CUDA driver’s installed properly. But, when I run node.execute(“nvidia-smi”) it still gives the error about failing to communicate to the NVIDIA driver.
I ran the curl command (via SSH) and it looks like it ran properly. I’ll attach an image of the output. I ran the same nvidia-smi command (in the notebook) after the curl command in the terminal and it still gave the communication failure.
- This reply was modified 1 year, 7 months ago by Sarah Maxwell.
May 31, 2023 at 4:09 pm #4394OK so we are past the communications issues with NVidia website. Have you rebooted the VM to make sure the driver is installed properly?
May 31, 2023 at 4:14 pm #4395This is what a working configuration looks like (i’m using a VM at Clemson, but should be no different):
ubuntu@e2266eb4-2ee5-48c7-854f-6dfcd18a0739-gpu-ml:~$ lspci | grep -i nvidia 00:07.0 3D controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)
and the driver is listed:
ubuntu@e2266eb4-2ee5-48c7-854f-6dfcd18a0739-gpu-ml:~$ lsmod | grep nv nvidia_uvm 1265664 0 nvidia_drm 65536 2 nvidia_modeset 1273856 2 nvidia_drm nvidia 55701504 106 nvidia_uvm,nvidia_modeset drm_kms_helper 184320 4 cirrus,nvidia_drm drm 495616 8 drm_kms_helper,nvidia,cirrus,nvidia_drm
May 31, 2023 at 4:15 pm #4397Oh! I haven’t since I ran the curl command, but I did when I installed it through the notebook.
Ok, I just ran the ‘sudo reboot’ command followed by the nvidia-smi command and it still puts out the failed to communicate with NVIDIA driver.
May 31, 2023 at 4:19 pm #4399Do you see nvidia driver actuall installed? As I said I did not see it on the list you sent earlier. Just above in my previous message I showed what correct output should look like (modulo that I was using ubuntu 20 – not sure what you are using).
May 31, 2023 at 4:19 pm #4400Ok your lspci configuration looks the same as mine, but when I run the lsmod it doesn’t output anything.
May 31, 2023 at 4:20 pm #4401So that means the driver isn’t installed. I suggest you re-run the driver installation by hand (since you are already on the console) and see if you can see any errors.
nvidia-smi
fails because no NVidia driver is installed. -
AuthorPosts
- The topic ‘Issue with NVIDIA driver on basic_gpu_devices’ is closed to new replies.