Forum Replies Created
-
AuthorPosts
-
So that means the driver isn’t installed. I suggest you re-run the driver installation by hand (since you are already on the console) and see if you can see any errors.
nvidia-smi
fails because no NVidia driver is installed.Do you see nvidia driver actuall installed? As I said I did not see it on the list you sent earlier. Just above in my previous message I showed what correct output should look like (modulo that I was using ubuntu 20 – not sure what you are using).
The real P4 switches may be more tolerant since they are built around the idea of modifying frame formats. I do not know the details of what you are trying to do, but invalid Ethernet frames will not pass through our dataplane switches.
There’s a simple experiment you can try – try to pass the frames unmodified – if they go through, but your modified frames do not, your frames are not considered valid by our switches and are being dropped.
This is what a working configuration looks like (i’m using a VM at Clemson, but should be no different):
ubuntu@e2266eb4-2ee5-48c7-854f-6dfcd18a0739-gpu-ml:~$ lspci | grep -i nvidia 00:07.0 3D controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)
and the driver is listed:
ubuntu@e2266eb4-2ee5-48c7-854f-6dfcd18a0739-gpu-ml:~$ lsmod | grep nv nvidia_uvm 1265664 0 nvidia_drm 65536 2 nvidia_modeset 1273856 2 nvidia_drm nvidia 55701504 106 nvidia_uvm,nvidia_modeset drm_kms_helper 184320 4 cirrus,nvidia_drm drm 495616 8 drm_kms_helper,nvidia,cirrus,nvidia_drm
OK so we are past the communications issues with NVidia website. Have you rebooted the VM to make sure the driver is installed properly?
No it does not. Mininet doesn’t necessarily generates valid Ethernet frames. The most likely issue is that our (FABRIC) dataplane switch that interconnects the ports is dropping the frames because they are not valid.
So looks like the card is attached, but I don’t see the nvidia driver loaded in kernel modules, so it probably didn’t get installed properly.
Are you still getting an error from trying to download the repo index (reaching https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo)?
Let’s focus on using your VM at TACC for now. As Mert indicated above, some of the earlier errors on other sites may have been due to the fact that you were ending up on a site with IPv6 management network and NVidia site unfortunately has limited to no IPv6 presence. There are two solutions to that situation:
1. Use a different site, which is what you did – so we will stick with that
2. Update your VM DNS configuration to use something called NAT64 which allows IPv4 and IPv6 networks to communicate and resolve names. There is a notebook that describes how to deal with it called ‘Access non-IPv6 services (i.e. GitHub) from IPv6 FABRIC nodes’ that is listed from the ‘Start Here’ notebook. <- this is just so you know in the future
For now can you get on the console of your TACC VM (via SSH) and run this command to see if it can reach NVidia site:
$ curl https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
If
curl
isn’t available, you can trywget
… and also
sudo lsmod
All FABRIC network services are provided at Ethernet layer and above, which does require that experiments generate valid Ethernet frames. If your experiment generates invalid Ethernet frames, then our switches (that inteconnect the servers together) will silently drop them.
Sarah,
Can you post the output of ‘sudo lspci` command?
May 26, 2023 at 3:16 pm in reply to: Get_slice and list_nodes execution takes more than 10 minutes. #4326The currently valid bastion keys are listed in the Portal under ‘User Profile’/’SSH Keys’ (click on ‘Bastion’ tab). If you don’t remember which key is which, you can always take a fingerprint of either the public or private key (they are the same) and compare to the fingerprints shown in the portal:
$ ssh-keygen -E md5 -lf ~/path/to/the/key/file
More info here: https://learn.fabric-testbed.net/knowledge-base/logging-into-fabric-vms/
- This reply was modified 1 year, 8 months ago by Ilya Baldin.
- This reply was modified 1 year, 8 months ago by Ilya Baldin.
Dear experimenters,
We believe we have gotten to the bottom of the Kafka issues and the testbed is reopening. SRI will remain in maintenance, WASH will also remain in maintenance. Workers at GATech (#3) and STAR (#6) will be in maintenance and note that we have a number of planned outages for workers across multiple sites to install FPGAs (these were mentioned on previous announcements) in the past few days. As the work gets completed, those will be taken off maintenance in due course.
May 11, 2023 at 8:15 pm in reply to: getting error 403 : Forbidden while loggin into jupyterHub #4223Hello,
Please make sure you are logging into Jupyter Hub using the same credentials as for the portal. If the problem persists, please visit Contact Us link above or in the portal and report this as an account problem.
@Fengpin – our authority at individual sites w.r.t. management plane connections end on our switch, so if the campus network blinks, we have no control over that. Is there a way for you to force PBR to pick the default via management connection again?
May 11, 2023 at 12:48 pm in reply to: Bastion login is undefined even though SSH keys are uploaded #4204You may want to try logging out and then logging back into the portal and see if that fixes the problem.
1 user thanked author for this post.
-
AuthorPosts