1. Ilya Baldin

Ilya Baldin

Forum Replies Created

Viewing 15 posts - 136 through 150 (of 285 total)
  • Author
    Posts
  • in reply to: Issue with NVIDIA driver on basic_gpu_devices #4401
    Ilya Baldin
    Participant

      So that means the driver isn’t installed. I suggest you re-run the driver installation by hand (since you are already on the console) and see if you can see any errors. nvidia-smi fails because no NVidia driver is installed.

      in reply to: Issue with NVIDIA driver on basic_gpu_devices #4399
      Ilya Baldin
      Participant

        Do you see nvidia driver actuall installed? As I said I did not see it on the list you sent earlier. Just above in my previous message I showed what correct output should look like (modulo that I was using ubuntu 20 – not sure what you are using).

        in reply to: MRI example on Fabric-Testbed Configuration help. #4398
        Ilya Baldin
        Participant

          The real P4 switches may be more tolerant since they are built around the idea of modifying frame formats. I do not know the details of what you are trying to do, but invalid Ethernet frames will not pass through our dataplane switches.

          There’s a simple experiment you can try – try to pass the frames unmodified – if they go through, but your modified frames do not, your frames are not considered valid by our switches and are being dropped.

          in reply to: Issue with NVIDIA driver on basic_gpu_devices #4395
          Ilya Baldin
          Participant

            This is what a working configuration looks like (i’m using a VM at Clemson, but should be no different):

            ubuntu@e2266eb4-2ee5-48c7-854f-6dfcd18a0739-gpu-ml:~$ lspci | grep -i nvidia
            00:07.0 3D controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)

            and the driver is listed:

            ubuntu@e2266eb4-2ee5-48c7-854f-6dfcd18a0739-gpu-ml:~$ lsmod | grep nv
            nvidia_uvm 1265664 0
            nvidia_drm 65536 2
            nvidia_modeset 1273856 2 nvidia_drm
            nvidia 55701504 106 nvidia_uvm,nvidia_modeset
            drm_kms_helper 184320 4 cirrus,nvidia_drm
            drm 495616 8 drm_kms_helper,nvidia,cirrus,nvidia_drm
            
            
            in reply to: Issue with NVIDIA driver on basic_gpu_devices #4394
            Ilya Baldin
            Participant

              OK so we are past the communications issues with NVidia website. Have you rebooted the VM to make sure the driver is installed properly?

              in reply to: MRI example on Fabric-Testbed Configuration help. #4370
              Ilya Baldin
              Participant

                No it does not. Mininet doesn’t necessarily generates valid Ethernet frames. The most likely issue is that our (FABRIC) dataplane switch that interconnects the ports is dropping the frames because they are not valid.

                in reply to: Issue with NVIDIA driver on basic_gpu_devices #4367
                Ilya Baldin
                Participant

                  So looks like the card is attached, but I don’t see the nvidia driver loaded in kernel modules, so it probably didn’t get installed properly.

                  Are you still getting an error from trying to download the repo index (reaching https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo)?

                  Let’s focus on using your VM at TACC for now. As Mert indicated above, some of the earlier errors on other sites may have been due to the fact that you were ending up on a site with IPv6 management network and NVidia site unfortunately has limited to no IPv6 presence. There are two solutions to that situation:

                  1. Use a different site, which is what you did – so we will stick with that

                  2. Update your VM DNS configuration to use something called NAT64 which allows IPv4 and IPv6 networks to communicate and resolve names. There is a notebook that describes how to deal with it called ‘Access non-IPv6 services (i.e. GitHub) from IPv6 FABRIC nodes’ that is listed from the ‘Start Here’ notebook. <- this is just so you know in the future

                  For now can you get on the console of your TACC VM (via SSH) and run this command to see if it can reach NVidia site:

                  $ curl https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo

                  If curl isn’t available, you can try wget

                  in reply to: Issue with NVIDIA driver on basic_gpu_devices #4352
                  Ilya Baldin
                  Participant

                    … and also sudo lsmod

                    in reply to: MRI example on Fabric-Testbed Configuration help. #4351
                    Ilya Baldin
                    Participant

                      All FABRIC network services are provided at Ethernet layer and above, which does require that experiments generate valid Ethernet frames. If your experiment generates invalid Ethernet frames, then our switches (that inteconnect the servers together) will silently drop them.

                      in reply to: Issue with NVIDIA driver on basic_gpu_devices #4350
                      Ilya Baldin
                      Participant

                        Sarah,

                        Can you post the output of ‘sudo lspci` command?

                        Ilya Baldin
                        Participant

                          The currently valid bastion keys are listed in the Portal under ‘User Profile’/’SSH Keys’ (click on ‘Bastion’ tab). If you don’t remember which key is which, you can always take a fingerprint of either the public or private key (they are the same) and compare to the fingerprints shown in the portal:

                          $ ssh-keygen -E md5 -lf ~/path/to/the/key/file

                          More info here: https://learn.fabric-testbed.net/knowledge-base/logging-into-fabric-vms/

                          • This reply was modified 1 year, 8 months ago by Ilya Baldin.
                          • This reply was modified 1 year, 8 months ago by Ilya Baldin.
                          in reply to: Multiple problems in FABRIC [Partially Resolved] #4251
                          Ilya Baldin
                          Participant

                            Dear experimenters,

                            We believe we have gotten to the bottom of the Kafka issues and the testbed is reopening. SRI will remain in maintenance, WASH will also remain in maintenance. Workers at GATech (#3) and STAR (#6) will be in maintenance and note that we have a number of planned outages for workers across multiple sites to install FPGAs (these were mentioned on previous announcements) in the past few days. As the work gets completed, those will be taken off maintenance in due course.

                            Ilya Baldin
                            Participant

                              Hello,

                              Please make sure you are logging into Jupyter Hub using the same credentials as for the portal. If the problem persists, please visit Contact Us link above or in the portal and report this as an account problem.

                              in reply to: lost management network connection #4221
                              Ilya Baldin
                              Participant

                                @Fengpin – our authority at individual sites w.r.t. management plane connections end on our switch, so if the campus network blinks, we have no control over that. Is there a way for you to force PBR to pick the default via management connection again?

                                Ilya Baldin
                                Participant

                                  You may want to try logging out and then logging back into the portal and see if that fixes the problem.

                                  1 user thanked author for this post.
                                Viewing 15 posts - 136 through 150 (of 285 total)