Ilya Baldin

Forum Replies Created

Viewing 15 posts - 136 through 150 (of 285 total)

← 1 2 3 … 9 10 11 … 17 18 19 →

Author

Posts
May 31, 2023 at 4:20 pm in reply to: Issue with NVIDIA driver on basic_gpu_devices #4401
Ilya Baldin
Participant
So that means the driver isn’t installed. I suggest you re-run the driver installation by hand (since you are already on the console) and see if you can see any errors. nvidia-smi fails because no NVidia driver is installed.
May 31, 2023 at 4:19 pm in reply to: Issue with NVIDIA driver on basic_gpu_devices #4399
Ilya Baldin
Participant
Do you see nvidia driver actuall installed? As I said I did not see it on the list you sent earlier. Just above in my previous message I showed what correct output should look like (modulo that I was using ubuntu 20 – not sure what you are using).
May 31, 2023 at 4:17 pm in reply to: MRI example on Fabric-Testbed Configuration help. #4398
Ilya Baldin
Participant
The real P4 switches may be more tolerant since they are built around the idea of modifying frame formats. I do not know the details of what you are trying to do, but invalid Ethernet frames will not pass through our dataplane switches.

There’s a simple experiment you can try – try to pass the frames unmodified – if they go through, but your modified frames do not, your frames are not considered valid by our switches and are being dropped.
May 31, 2023 at 4:14 pm in reply to: Issue with NVIDIA driver on basic_gpu_devices #4395
Ilya Baldin
Participant
This is what a working configuration looks like (i’m using a VM at Clemson, but should be no different):
```
ubuntu@e2266eb4-2ee5-48c7-854f-6dfcd18a0739-gpu-ml:~$ lspci | grep -i nvidia
00:07.0 3D controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)
```
and the driver is listed:
```
ubuntu@e2266eb4-2ee5-48c7-854f-6dfcd18a0739-gpu-ml:~$ lsmod | grep nv
nvidia_uvm 1265664 0
nvidia_drm 65536 2
nvidia_modeset 1273856 2 nvidia_drm
nvidia 55701504 106 nvidia_uvm,nvidia_modeset
drm_kms_helper 184320 4 cirrus,nvidia_drm
drm 495616 8 drm_kms_helper,nvidia,cirrus,nvidia_drm
```
May 31, 2023 at 4:09 pm in reply to: Issue with NVIDIA driver on basic_gpu_devices #4394
Ilya Baldin
Participant
OK so we are past the communications issues with NVidia website. Have you rebooted the VM to make sure the driver is installed properly?
May 31, 2023 at 3:33 pm in reply to: MRI example on Fabric-Testbed Configuration help. #4370
Ilya Baldin
Participant
No it does not. Mininet doesn’t necessarily generates valid Ethernet frames. The most likely issue is that our (FABRIC) dataplane switch that interconnects the ports is dropping the frames because they are not valid.
May 31, 2023 at 3:27 pm in reply to: Issue with NVIDIA driver on basic_gpu_devices #4367
Ilya Baldin
Participant
So looks like the card is attached, but I don’t see the nvidia driver loaded in kernel modules, so it probably didn’t get installed properly.

Are you still getting an error from trying to download the repo index (reaching https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo)?

Let’s focus on using your VM at TACC for now. As Mert indicated above, some of the earlier errors on other sites may have been due to the fact that you were ending up on a site with IPv6 management network and NVidia site unfortunately has limited to no IPv6 presence. There are two solutions to that situation:

1. Use a different site, which is what you did – so we will stick with that

2. Update your VM DNS configuration to use something called NAT64 which allows IPv4 and IPv6 networks to communicate and resolve names. There is a notebook that describes how to deal with it called ‘Access non-IPv6 services (i.e. GitHub) from IPv6 FABRIC nodes’ that is listed from the ‘Start Here’ notebook. <- this is just so you know in the future

For now can you get on the console of your TACC VM (via SSH) and run this command to see if it can reach NVidia site:

$ curl https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo

If curl isn’t available, you can try wget
May 31, 2023 at 2:37 pm in reply to: Issue with NVIDIA driver on basic_gpu_devices #4352
Ilya Baldin
Participant
… and also sudo lsmod
May 31, 2023 at 2:36 pm in reply to: MRI example on Fabric-Testbed Configuration help. #4351
Ilya Baldin
Participant
All FABRIC network services are provided at Ethernet layer and above, which does require that experiments generate valid Ethernet frames. If your experiment generates invalid Ethernet frames, then our switches (that inteconnect the servers together) will silently drop them.
May 31, 2023 at 2:31 pm in reply to: Issue with NVIDIA driver on basic_gpu_devices #4350
Ilya Baldin
Participant
Sarah,

Can you post the output of ‘sudo lspci` command?
May 26, 2023 at 3:16 pm in reply to: Get_slice and list_nodes execution takes more than 10 minutes. #4326
Ilya Baldin
Participant
The currently valid bastion keys are listed in the Portal under ‘User Profile’/’SSH Keys’ (click on ‘Bastion’ tab). If you don’t remember which key is which, you can always take a fingerprint of either the public or private key (they are the same) and compare to the fingerprints shown in the portal:

$ ssh-keygen -E md5 -lf ~/path/to/the/key/file

More info here: https://learn.fabric-testbed.net/knowledge-base/logging-into-fabric-vms/
- This reply was modified 2 years, 2 months ago by Ilya Baldin.
- This reply was modified 2 years, 2 months ago by Ilya Baldin.
May 16, 2023 at 1:57 pm in reply to: Multiple problems in FABRIC [Partially Resolved] #4251
Ilya Baldin
Participant
Dear experimenters,

We believe we have gotten to the bottom of the Kafka issues and the testbed is reopening. SRI will remain in maintenance, WASH will also remain in maintenance. Workers at GATech (#3) and STAR (#6) will be in maintenance and note that we have a number of planned outages for workers across multiple sites to install FPGAs (these were mentioned on previous announcements) in the past few days. As the work gets completed, those will be taken off maintenance in due course.
May 11, 2023 at 8:15 pm in reply to: getting error 403 : Forbidden while loggin into jupyterHub #4223
Ilya Baldin
Participant
Hello,

Please make sure you are logging into Jupyter Hub using the same credentials as for the portal. If the problem persists, please visit Contact Us link above or in the portal and report this as an account problem.
May 11, 2023 at 4:33 pm in reply to: lost management network connection #4221
Ilya Baldin
Participant
@Fengpin – our authority at individual sites w.r.t. management plane connections end on our switch, so if the campus network blinks, we have no control over that. Is there a way for you to force PBR to pick the default via management connection again?
May 11, 2023 at 12:48 pm in reply to: Bastion login is undefined even though SSH keys are uploaded #4204
Ilya Baldin
Participant
You may want to try logging out and then logging back into the portal and see if that fixes the problem.

1 user thanked author for this post.

Soumyadeep Datta
Author

Posts

Viewing 15 posts - 136 through 150 (of 285 total)

← 1 2 3 … 9 10 11 … 17 18 19 →

Forum Replies Created

1 user thanked author for this post.