1. Ilya Baldin

Ilya Baldin

Forum Replies Created

Viewing 15 posts - 121 through 135 (of 285 total)
  • Author
    Posts
  • in reply to: Testbed-wide maintenance July 17 10am-11am EDT #4703
    Ilya Baldin
    Participant

      Completed!

      Ilya Baldin
      Participant

        Fraida,

        I’ve added this to the list of issues to look at w.r.t. other behaviors of SharedNICs. We have a ticket opened with NVidia/Mellanox trying to figure this out (the documentation seems to suggest it should work – we can’t figure out if it is a firmware bug or we are doing something wrong). Thanks for you patience.

        in reply to: Maintenance on PSC for testing 07/13/2023 #4674
        Ilya Baldin
        Participant

          PSC is back in service.

          in reply to: Spawn failed: pod did not start in 300 seconds #4636
          Ilya Baldin
          Participant

            I believe this has been resolved now. We ran into some scaling issues with the Kubernetes cluster hosting the Hub. Thank you for reporting it.

            in reply to: Issue with NVIDIA driver on basic_gpu_devices #4618
            Ilya Baldin
            Participant

              Just to close this thread, notebooks starting with jupyter examples 1.5.0 have an updated GPU notebook – a single one for all GPU types, that properly installs the drivers from the NVidia site and also deals with IPv6 sites.

              in reply to: Root Permission #4615
              Ilya Baldin
              Participant

                Please indicate which image you are using. Standard images in FABRIC do not require a root password to execute commands via ‘sudo’.

                in reply to: Multi-day FABRIC maintenance (June 12-June 16, 2023) #4456
                Ilya Baldin
                Participant

                  Dear experimenters,

                  We want to share an update regarding the maintenance previously mentioned. We wish to confirm that the maintenance will indeed occur between June 12 and June 16. We understand the importance of this process and its impact on your work, and we want to assure you that we have carefully planned the updates to minimize any disruption to your activities.

                  To ensure a smooth transition, we have divided the sites into two groups. The first group, primarily consisting of Phase 2 sites such as UCSD, GPN, FIU, CLEM, GATECH, LOSA, NEWY, ATLA, SEAT, INDI, and CERN, will be available by June 16. The second group, comprising Phase 1 sites, may require a bit more time. However, we are committed to reopening the facility on June 16, enabling you to resume your experiments with the available sites while we bring back the remaining sites.

                  Following the maintenance window, FABRIC will emerge with updated software,  new capabilities and improved performance. These enhancements will provide you with an even more robust and performant research environment.

                  As a reminder, we request that you halt all experiments prior to June 12. Unfortunately, we will be unable to recover any experiments after the maintenance period. However, we want to reassure you that your SSH keys, data on persistent volumes (with some exceptions), and experiment notebooks will remain unaffected. We apologize for any inconvenience this may cause.

                  Thank you for your patience as we work diligently to optimize the FABRIC platform.

                  in reply to: FABRIC Tutorial Project cleanup #4446
                  Ilya Baldin
                  Participant

                    FABRIC Tutorials project membership cleanup has been completed.

                    Ilya Baldin
                    Participant

                      Hello,

                      Assuming you have gone through a QuickStart Guide pinned at the top of this forum https://learn.fabric-testbed.net/forums/topic/quick-start-guide/ you can find additional notebooks in the Jupyter Hub that show examples of what is possible.

                      in reply to: Sharing JupyterHub directory #4438
                      Ilya Baldin
                      Participant

                        Hello,

                        That is not possible, at least not in the way you are trying. Each Jupyter Hub container is individual to the user and other users cannot access its contents. We are working on a capability for artifact sharing that will allow you to share your notebooks with others, expected to be deployed in beta form in a few weeks.

                        In the meantime you can:

                        • Try using git/GitHub to achieve what you are trying to do.
                        • Try using Google Colab (we are investigating this ourselves and do not have detailed instructions – part of the problem may be the compatibility of the version of Python it uses to what our libraries require)
                        in reply to: Issue with NVIDIA driver on basic_gpu_devices #4425
                        Ilya Baldin
                        Participant

                          Yep so you can modify your notebook as follows:

                          1. Before the GPU PCI Device add these two cells:

                          command = "sudo dnf upgrade -q -y"
                          stdout, stderr = node.execute(command)

                          that’s to upgrade all packages and then next one to reboot (it’s exactly the same as the reboot below):

                          reboot = 'sudo reboot'
                          
                          print(reboot)
                          node.execute(reboot)
                          
                          slice.wait_ssh(timeout=360,interval=10,progress=True)
                          
                          print("Now testing SSH abilites to reconnect...",end="")
                          slice.update()
                          slice.test_ssh()
                          print("Reconnected!")

                          2. I changed the commands in the ‘Install Nvidia Drivers’ section (although I am not sure that’s needed – this is just the latest ‘official’ NVidia workflow):

                          commands = [
                          'sudo dnf install -q -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm',
                          'sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo',
                          'sudo dnf clean expire-cache',
                          'sudo dnf module install -q -y nvidia-driver:latest-dkms',
                          'sudo dnf install -q -y cuda'
                          ]

                          Then of course these commands need to be executed in order and a reboot. After that things should work.

                          I will patch up the notebooks so this will appear in the next release.

                           

                          1 user thanked author for this post.
                          in reply to: Issue with NVIDIA driver on basic_gpu_devices #4424
                          Ilya Baldin
                          Participant

                            I’ll try a clean slice with adding sudo dnf -y upgrade as part of that notebook.

                            in reply to: Issue with NVIDIA driver on basic_gpu_devices #4423
                            Ilya Baldin
                            Participant

                              I think something changed in the NVidia install. I was able to load NVidia drivers by doing sudo dnf -y upgrade to basically update everything to the latest. This was after I installed NVidia stuff. After I did sudo /sbin/reboot the nvidia drivers were already loaded and nvidia-smi worked:

                              [rocky@2d7fd3c5-c433-4a2b-94c5-6b74d4ecc014-rtx ~]$ nvidia-smi
                              Wed May 31 21:49:40 2023 
                              +---------------------------------------------------------------------------------------+
                              | NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
                              |-----------------------------------------+----------------------+----------------------+
                              | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
                              | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
                              | | | MIG M. |
                              |=========================================+======================+======================|
                              | 0 Quadro RTX 6000 Off| 00000000:00:07.0 Off | 0 |
                              | N/A 26C P0 23W / 250W| 0MiB / 23040MiB | 0% Default |
                              | | | N/A |
                              +-----------------------------------------+----------------------+----------------------+
                              
                              +---------------------------------------------------------------------------------------+
                              | Processes: |
                              | GPU GI CI PID Type Process name GPU Memory |
                              | ID ID Usage |
                              |=======================================================================================|
                              | No running processes found |
                              +---------------------------------------------------------------------------------------+
                              in reply to: Issue with NVIDIA driver on basic_gpu_devices #4422
                              Ilya Baldin
                              Participant

                                I just re-ran the notebook you are using – I’m seeing the same thing. Something is not quite right with the installation process – there are no errors, but nvidia modules are not installed, I’ll investigate.

                                in reply to: MRI example on Fabric-Testbed Configuration help. #4421
                                Ilya Baldin
                                Participant

                                  Any header that is not valid will not pass. Without knowing more about what and how you are modifying I cannot answer. Valid packets should pass through without problems. You can use wireshark or similar to look at your packet traces to see if it flags anything between unmodified and modified frames.

                                Viewing 15 posts - 121 through 135 (of 285 total)