Issue with NVIDIA driver on basic_gpu_devices

This topic has 20 replies, 3 voices, and was last updated 2 years, 5 months ago by Ilya Baldin.

Viewing 15 posts - 1 through 15 (of 21 total)

1 2 →

Author

Posts
May 31, 2023 at 2:03 pm #4343
Sarah Maxwell
Participant
When running through the basic_gpu_devices.ipynb journal, I’m having trouble with installing the Nvidia drivers. When I run the given cells, I have most recently receive the error “Curl error (7): Couldn’t connect to server for https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo []”, but have also received the notification that “NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running” when the curl error is not present.

Should there be an updated command/ link to install the most current driver and is that something I can do on my end? Or is this something specific to the site I’m using? The current commands being used to install the NVIDIA driver/ CUDA are:

‘sudo dnf install -q -y epel-release’,
‘sudo dnf config-manager –add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo‘,
‘sudo dnf install -q -y kernel-devel kernel-headers nvidia-driver nvidia-settings cuda-driver cuda’
May 31, 2023 at 2:07 pm #4344
Mert Cevik
Moderator
Can you please let us know which site(s) you are using and received this error?
May 31, 2023 at 2:13 pm #4345
Sarah Maxwell
Participant
It’s set up to select a random site that has rtx6000 available, but the one I most recently used was with MICH.
May 31, 2023 at 2:30 pm #4349
Mert Cevik
Moderator
It seems that developer.download.nvidia.com can be resolved for only IPv4. We will find out if there is a workaround (and follow up on this thread), but short-term solution is using one the following racks on your experiments.

MAX, TACC, MASS, UCSD, GPN.

These racks provide IPv4 addresses for the VM Management Network.
May 31, 2023 at 2:31 pm #4350
Ilya Baldin
Participant
Sarah,

Can you post the output of ‘sudo lspci` command?
May 31, 2023 at 2:37 pm #4352
Ilya Baldin
Participant
… and also sudo lsmod
May 31, 2023 at 2:54 pm #4355
Sarah Maxwell
Participant
Sure, this is at the TACC site.
May 31, 2023 at 3:27 pm #4367
Ilya Baldin
Participant
So looks like the card is attached, but I don’t see the nvidia driver loaded in kernel modules, so it probably didn’t get installed properly.

Are you still getting an error from trying to download the repo index (reaching https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo)?

Let’s focus on using your VM at TACC for now. As Mert indicated above, some of the earlier errors on other sites may have been due to the fact that you were ending up on a site with IPv6 management network and NVidia site unfortunately has limited to no IPv6 presence. There are two solutions to that situation:

1. Use a different site, which is what you did – so we will stick with that

2. Update your VM DNS configuration to use something called NAT64 which allows IPv4 and IPv6 networks to communicate and resolve names. There is a notebook that describes how to deal with it called ‘Access non-IPv6 services (i.e. GitHub) from IPv6 FABRIC nodes’ that is listed from the ‘Start Here’ notebook. <- this is just so you know in the future

For now can you get on the console of your TACC VM (via SSH) and run this command to see if it can reach NVidia site:

$ curl https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo

If curl isn’t available, you can try wget
May 31, 2023 at 4:06 pm #4391
Sarah Maxwell
Participant
When I download from the repo index using the code provided I don’t receive an error and it looks like both the NVIDIA and CUDA driver’s installed properly. But, when I run node.execute(“nvidia-smi”) it still gives the error about failing to communicate to the NVIDIA driver.

I ran the curl command (via SSH) and it looks like it ran properly. I’ll attach an image of the output. I ran the same nvidia-smi command (in the notebook) after the curl command in the terminal and it still gave the communication failure.
- This reply was modified 2 years, 6 months ago by Sarah Maxwell.
May 31, 2023 at 4:09 pm #4394
Ilya Baldin
Participant
OK so we are past the communications issues with NVidia website. Have you rebooted the VM to make sure the driver is installed properly?
May 31, 2023 at 4:14 pm #4395
Ilya Baldin
Participant
This is what a working configuration looks like (i’m using a VM at Clemson, but should be no different):
```
ubuntu@e2266eb4-2ee5-48c7-854f-6dfcd18a0739-gpu-ml:~$ lspci | grep -i nvidia
00:07.0 3D controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)
```
and the driver is listed:
```
ubuntu@e2266eb4-2ee5-48c7-854f-6dfcd18a0739-gpu-ml:~$ lsmod | grep nv
nvidia_uvm 1265664 0
nvidia_drm 65536 2
nvidia_modeset 1273856 2 nvidia_drm
nvidia 55701504 106 nvidia_uvm,nvidia_modeset
drm_kms_helper 184320 4 cirrus,nvidia_drm
drm 495616 8 drm_kms_helper,nvidia,cirrus,nvidia_drm
```
May 31, 2023 at 4:15 pm #4397
Sarah Maxwell
Participant
Oh! I haven’t since I ran the curl command, but I did when I installed it through the notebook.

Ok, I just ran the ‘sudo reboot’ command followed by the nvidia-smi command and it still puts out the failed to communicate with NVIDIA driver.
May 31, 2023 at 4:19 pm #4399
Ilya Baldin
Participant
Do you see nvidia driver actuall installed? As I said I did not see it on the list you sent earlier. Just above in my previous message I showed what correct output should look like (modulo that I was using ubuntu 20 – not sure what you are using).
May 31, 2023 at 4:19 pm #4400
Sarah Maxwell
Participant
Ok your lspci configuration looks the same as mine, but when I run the lsmod it doesn’t output anything.
May 31, 2023 at 4:20 pm #4401
Ilya Baldin
Participant
So that means the driver isn’t installed. I suggest you re-run the driver installation by hand (since you are already on the console) and see if you can see any errors. nvidia-smi fails because no NVidia driver is installed.
Author

Posts

Viewing 15 posts - 1 through 15 (of 21 total)

1 2 →

The topic ‘Issue with NVIDIA driver on basic_gpu_devices’ is closed to new replies.