1. Issue with NVIDIA driver on basic_gpu_devices

Issue with NVIDIA driver on basic_gpu_devices

Home Forums FABRIC General Questions and Discussion Issue with NVIDIA driver on basic_gpu_devices

Viewing 6 posts - 16 through 21 (of 21 total)
  • Author
    Posts
  • #4403
    Sarah Maxwell
    Participant

      There are no errors when I run the three install commands from the notebook, but the last command doesnt output anything in addition to the lsmod command still not outputing. Does this mean I’m missing a command for the install, or have an out of date command?

      #4422
      Ilya Baldin
      Participant

        I just re-ran the notebook you are using – I’m seeing the same thing. Something is not quite right with the installation process – there are no errors, but nvidia modules are not installed, I’ll investigate.

        #4423
        Ilya Baldin
        Participant

          I think something changed in the NVidia install. I was able to load NVidia drivers by doing sudo dnf -y upgrade to basically update everything to the latest. This was after I installed NVidia stuff. After I did sudo /sbin/reboot the nvidia drivers were already loaded and nvidia-smi worked:

          [rocky@2d7fd3c5-c433-4a2b-94c5-6b74d4ecc014-rtx ~]$ nvidia-smi
          Wed May 31 21:49:40 2023 
          +---------------------------------------------------------------------------------------+
          | NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
          |-----------------------------------------+----------------------+----------------------+
          | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
          | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
          | | | MIG M. |
          |=========================================+======================+======================|
          | 0 Quadro RTX 6000 Off| 00000000:00:07.0 Off | 0 |
          | N/A 26C P0 23W / 250W| 0MiB / 23040MiB | 0% Default |
          | | | N/A |
          +-----------------------------------------+----------------------+----------------------+
          
          +---------------------------------------------------------------------------------------+
          | Processes: |
          | GPU GI CI PID Type Process name GPU Memory |
          | ID ID Usage |
          |=======================================================================================|
          | No running processes found |
          +---------------------------------------------------------------------------------------+
          #4424
          Ilya Baldin
          Participant

            I’ll try a clean slice with adding sudo dnf -y upgrade as part of that notebook.

            #4425
            Ilya Baldin
            Participant

              Yep so you can modify your notebook as follows:

              1. Before the GPU PCI Device add these two cells:

              command = "sudo dnf upgrade -q -y"
              stdout, stderr = node.execute(command)

              that’s to upgrade all packages and then next one to reboot (it’s exactly the same as the reboot below):

              reboot = 'sudo reboot'
              
              print(reboot)
              node.execute(reboot)
              
              slice.wait_ssh(timeout=360,interval=10,progress=True)
              
              print("Now testing SSH abilites to reconnect...",end="")
              slice.update()
              slice.test_ssh()
              print("Reconnected!")

              2. I changed the commands in the ‘Install Nvidia Drivers’ section (although I am not sure that’s needed – this is just the latest ‘official’ NVidia workflow):

              commands = [
              'sudo dnf install -q -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm',
              'sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo',
              'sudo dnf clean expire-cache',
              'sudo dnf module install -q -y nvidia-driver:latest-dkms',
              'sudo dnf install -q -y cuda'
              ]

              Then of course these commands need to be executed in order and a reboot. After that things should work.

              I will patch up the notebooks so this will appear in the next release.

               

              1 user thanked author for this post.
              #4618
              Ilya Baldin
              Participant

                Just to close this thread, notebooks starting with jupyter examples 1.5.0 have an updated GPU notebook – a single one for all GPU types, that properly installs the drivers from the NVidia site and also deals with IPv6 sites.

              Viewing 6 posts - 16 through 21 (of 21 total)
              • The topic ‘Issue with NVIDIA driver on basic_gpu_devices’ is closed to new replies.