1. Issue with NVIDIA driver on basic_gpu_devices

Issue with NVIDIA driver on basic_gpu_devices

Home Forums FABRIC General Questions and Discussion Issue with NVIDIA driver on basic_gpu_devices

Viewing 15 posts - 1 through 15 (of 21 total)
  • Author
    Posts
  • #4343
    Sarah Maxwell
    Participant

      When running through the basic_gpu_devices.ipynb journal, I’m having trouble with installing the Nvidia drivers.  When I run the given cells,  I have most recently receive the error “Curl error (7): Couldn’t connect to server for https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo []”, but have also  received the notification that “NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running” when the curl error is not present.

      Should there be an updated command/ link to install the most current driver and is that something I can do on my end? Or is this something specific to the site I’m using? The current commands being used to install the NVIDIA driver/ CUDA are:

      ‘sudo dnf install -q -y epel-release’,
      ‘sudo dnf config-manager –add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo‘,
      ‘sudo dnf install -q -y kernel-devel kernel-headers nvidia-driver nvidia-settings cuda-driver cuda’

       

      #4344
      Mert Cevik
      Moderator

        Can you please let us know which site(s) you are using and received this error?

        #4345
        Sarah Maxwell
        Participant

          It’s set up to select a random site that has rtx6000 available, but the one I most recently used was with MICH.

          #4349
          Mert Cevik
          Moderator

            It seems that developer.download.nvidia.com can be resolved for only IPv4. We will find out if there is a workaround (and follow up on this thread), but short-term solution is using one the following racks on your experiments.

            MAX, TACC, MASS, UCSD, GPN.

            These racks provide IPv4 addresses for the VM Management Network.

            #4350
            Ilya Baldin
            Participant

              Sarah,

              Can you post the output of ‘sudo lspci` command?

              #4352
              Ilya Baldin
              Participant

                … and also sudo lsmod

                #4355
                Sarah Maxwell
                Participant

                  Sure, this is at the TACC site.

                  #4367
                  Ilya Baldin
                  Participant

                    So looks like the card is attached, but I don’t see the nvidia driver loaded in kernel modules, so it probably didn’t get installed properly.

                    Are you still getting an error from trying to download the repo index (reaching https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo)?

                    Let’s focus on using your VM at TACC for now. As Mert indicated above, some of the earlier errors on other sites may have been due to the fact that you were ending up on a site with IPv6 management network and NVidia site unfortunately has limited to no IPv6 presence. There are two solutions to that situation:

                    1. Use a different site, which is what you did – so we will stick with that

                    2. Update your VM DNS configuration to use something called NAT64 which allows IPv4 and IPv6 networks to communicate and resolve names. There is a notebook that describes how to deal with it called ‘Access non-IPv6 services (i.e. GitHub) from IPv6 FABRIC nodes’ that is listed from the ‘Start Here’ notebook. <- this is just so you know in the future

                    For now can you get on the console of your TACC VM (via SSH) and run this command to see if it can reach NVidia site:

                    $ curl https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo

                    If curl isn’t available, you can try wget

                    #4391
                    Sarah Maxwell
                    Participant

                      When I download from the repo index using the code provided I don’t receive an error and it looks like both the NVIDIA and CUDA driver’s installed properly. But, when I run node.execute(“nvidia-smi”) it still gives the error about failing to communicate to the NVIDIA driver.

                      I ran the curl command (via SSH) and it looks like it ran properly. I’ll attach an image of the output. I ran the same nvidia-smi command (in the notebook) after the curl command in the terminal and it still gave the communication failure.

                      #4394
                      Ilya Baldin
                      Participant

                        OK so we are past the communications issues with NVidia website. Have you rebooted the VM to make sure the driver is installed properly?

                        #4395
                        Ilya Baldin
                        Participant

                          This is what a working configuration looks like (i’m using a VM at Clemson, but should be no different):

                          ubuntu@e2266eb4-2ee5-48c7-854f-6dfcd18a0739-gpu-ml:~$ lspci | grep -i nvidia
                          00:07.0 3D controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)

                          and the driver is listed:

                          ubuntu@e2266eb4-2ee5-48c7-854f-6dfcd18a0739-gpu-ml:~$ lsmod | grep nv
                          nvidia_uvm 1265664 0
                          nvidia_drm 65536 2
                          nvidia_modeset 1273856 2 nvidia_drm
                          nvidia 55701504 106 nvidia_uvm,nvidia_modeset
                          drm_kms_helper 184320 4 cirrus,nvidia_drm
                          drm 495616 8 drm_kms_helper,nvidia,cirrus,nvidia_drm
                          
                          
                          #4397
                          Sarah Maxwell
                          Participant

                            Oh! I haven’t since I ran the curl command, but I did when I installed it through the notebook.

                            Ok, I just ran the ‘sudo reboot’ command followed by the nvidia-smi command and it still puts out the failed to communicate with NVIDIA driver.

                            #4399
                            Ilya Baldin
                            Participant

                              Do you see nvidia driver actuall installed? As I said I did not see it on the list you sent earlier. Just above in my previous message I showed what correct output should look like (modulo that I was using ubuntu 20 – not sure what you are using).

                              #4400
                              Sarah Maxwell
                              Participant

                                Ok your lspci configuration looks the same as mine, but when I run the lsmod it doesn’t output anything.

                                #4401
                                Ilya Baldin
                                Participant

                                  So that means the driver isn’t installed. I suggest you re-run the driver installation by hand (since you are already on the console) and see if you can see any errors. nvidia-smi fails because no NVidia driver is installed.

                                Viewing 15 posts - 1 through 15 (of 21 total)
                                • The topic ‘Issue with NVIDIA driver on basic_gpu_devices’ is closed to new replies.