1. Mert Cevik

Mert Cevik

Forum Replies Created

Viewing 15 posts - 1 through 15 (of 198 total)
  • Author
    Posts
  • in reply to: DPU Shut Down, can’t bring it back up #9543
    Mert Cevik
    Moderator

      Hi Tanay,

      As a next step, we can try cold-rebooting the server that is holding the DPU, however this is not possible when other users have VM slivers running on it. I need to make special arrangements for that.

      On our Development environment, we have a BlueField-2 DPU and we can perform all kinds of trials on it. You pointed the web page that describes how the configuration steps, but it can be even better if you provide us a complete list of commands for this configuration, so we can test it on the Development site. If there is any variance across BlueField-2 and BlueField-3, it will be good to indicate as well. Even, currently I’m preparing for additional BlueField-3 integrations, so I have BlueField-3 cards just delivered and I can use one card and test on the Development site with a BlueField-3 later.

      And lastly, on the web page under How-Plug Firmware Configuration section, there is a note as “Hotplug is not guaranteed to work on AMD machines.” Servers on the FABRIC Testbed infrastructure are all AMD-based Dell R7525 servers. I’m not sure if this may be relevant to our issue.

      Best regards,
      Mert

       

       

      in reply to: DPU Shut Down, can’t bring it back up #9537
      Mert Cevik
      Moderator

        Hi Tanay,

        I performed a power reset for the DPU. Can you please check if that worked well for the firmware configuration change?


        ubuntu@localhost:~$ uname -a
        Linux localhost.localdomain 5.15.0-1065-bluefield #67-Ubuntu SMP Tue Apr 22 11:10:15 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux
        ubuntu@localhost:~$ uptime
        16:19:21 up 1 min, 1 user, load average: 6.83, 2.15, 0.75

        I will be able to describe the details about how I performed this later. Mainly, I had included the BMC bindings to the DPU integration, and I utilized this path, however I’m not sure very much sure about the terminology or specifics, just some intuitive actions so far. I’m also in touch with the FABRIC team about this item, so your input about the progress will be helpful for our further enhancements.

        • This reply was modified 1 week, 5 days ago by Mert Cevik.
        • This reply was modified 1 week, 5 days ago by Mert Cevik.
        in reply to: DPU Shut Down, can’t bring it back up #9531
        Mert Cevik
        Moderator

          DPU on the SEAT node is recovered and it can be used for experiments.

          For the firmware configuration, I need to read the documentation. I have no prior experience with these cards.

          in reply to: DPU Shut Down, can’t bring it back up #9529
          Mert Cevik
          Moderator

            Hello Tanay,

            Can you share the state of your slice and slivers from your point of view? All slivers of the slice seem to be deleted.

            Best regards,
            Mert

            in reply to: SSH connection to slice nodes, failed #9501
            Mert Cevik
            Moderator

              So, since you’re able to login to this problematic VM from other sources, then you can check and make sure the right SSH key is inside the VM. I just placed my SSH key in it, and I could login properly. Please let us know about the status following your SSH key check and I will take a look further.

              in reply to: SSH connection to slice nodes, failed #9499
              Mert Cevik
              Moderator

                If I understand the problem from the description correctly (“manual connect”), you’re trying to connect to the VM(s) from a terminal on your computer/laptop and getting the error. If that’s the case, you need to set up your ssh client configuration file and ssh keys properly (in your computer/laptop) and connect. This page can be helpful -> https://learn.fabric-testbed.net/knowledge-base/logging-into-fabric-vms/

                If my understanding is wrong and problem is something different, please disregard the info above.

                 

                in reply to: Lost SSH login to a node #9498
                Mert Cevik
                Moderator

                  Thank you Komal for the information.

                  Khawar, can you please describe the directory where your “critical data” resides on the VM?

                  in reply to: Lost SSH login to a node #9494
                  Mert Cevik
                  Moderator

                    I checked your VM and found it in a crashed state. I’m not sure about the reason, when/how it was crashed or rebooted without digging into the logs, but the worker node (star-w2) it’s running on is fully occupied with VMs and we will look into possible out of memory issues on the hypervisor. It can be good if you re-create this VM on another worker node on STAR or use a smaller flavor to run a VM on star-w2.

                    in reply to: Lost SSH login to a node #9492
                    Mert Cevik
                    Moderator

                      Hello Khawar,

                      Your VM was shut down by the hypervisor and I started it now. Please let us know if you have any other issues. We will be investigating the main cause of this shut down internally.

                      Best regards,
                      Mert

                      in reply to: Trouble creating a slice #9440
                      Mert Cevik
                      Moderator

                        Issue with the bastion host traffic is resolved. You can try creating your slices with the standard bastion host settings (with bastion.fabric-testbed.net)

                        in reply to: Trouble creating a slice #9439
                        Mert Cevik
                        Moderator

                          There is a problem with upstream connectivity affecting one of the bastion hosts, causing intermittent interruptions. We are working on the issue. In the mean time, you can set a specific bastion host in your fabric_rc file (eg FABRIC_BASTION_HOST=bastion-renc-1.fabric-testbed.net) . We will notify about the status of the actual issue

                           

                          in reply to: Maintenance on UCSD on 12/30/25 #9343
                          Mert Cevik
                          Moderator

                            This maintenance is completed. UCSD is open for experiments.

                            in reply to: Maintenance on CLEM on 12/23/2025 #9324
                            Mert Cevik
                            Moderator

                              Maintenance is completed. CLEM node is available for experiments.

                              Mert Cevik
                              Moderator

                                Hello Danilo,

                                I checked both VMs (NS2 and NS4) and they are up and online. However there seems to be some changes in the SSH key(s) that are injected by the system and I cannot login to the VMs. I’m not sure about the main root cause of the exceptions that you posted, they indicate that the FABRIC orchestration system cannot perform actions on the VMs, and the issue with the SSH keys may be the main cause. We need to learn from you what might have changed on the VMs.

                                I’m posting the reservation info here. Both VM reservations are valid until 12/25

                                 {
                                "sliver_id": "be7426fd-7ebe-4e6b-bc65-56e30d7e8e50",
                                "slice_id": "53cfa2bd-5110-420d-8bdb-053c1af45801",
                                "type": "VM",
                                "notices": "Reservation be7426fd-7ebe-4e6b-bc65-56e30d7e8e50 (Slice NS2(53cfa2bd-5110-420d-8bdb-053c1af45801) Graph Id:2aced1b1-3843-4809-b9a2-9ed9e2ff0317 Owner:daniloassis@utfpr.edu.br) is in state (Active,None_) ",
                                "start": "2025-08-02 12:44:10 +0000",
                                "end": "2025-12-25 18:11:58 +0000",
                                "requested_end": "2025-12-25 18:11:58 +0000",
                                "units": 1,
                                "state": 4,
                                "pending_state": 11,
                                "sliver": {
                                "Name": "NS2",
                                "Type": "VM",
                                "Capacities": "{\"core\": 16, \"disk\": 1000, \"ram\": 16}",
                                "CapacityHints": "{\"instance_type\": \"fabric.c16.m16.d1000\"}",
                                "CapacityAllocations": "{\"core\": 16, \"disk\": 1000, \"ram\": 16}",
                                "LabelAllocations": "{\"instance\": \"instance-000040ec\", \"instance_parent\": \"star-w1.fabric-testbed.net\"}",
                                "ReservationInfo": "{\"reservation_id\": \"be7426fd-7ebe-4e6b-bc65-56e30d7e8e50\", \"reservation_state\": \"Active\"}",
                                "NodeMap": "[\"e2b1e451-45b4-4691-9527-aae18cad3b19\", \"BBQSH63\"]",
                                "StitchNode": "false",
                                "ImageRef": "default_ubuntu_22,qcow2",
                                "MgmtIp": "2001:400:a100:3030:f816:3eff:fed9:14ee",
                                "Site": "STAR"
                                }
                                }
                                
                                 {
                                "sliver_id": "527af509-fb62-41ae-a0c0-191e3c7f6525",
                                "slice_id": "059939f1-19be-49ee-ab5b-bf4504639c13",
                                "type": "VM",
                                "notices": "Reservation 527af509-fb62-41ae-a0c0-191e3c7f6525 (Slice NS4(059939f1-19be-49ee-ab5b-bf4504639c13) Graph Id:14a3d7f7-e797-4e75-bafb-00282ea63896 Owner:daniloassis@utfpr.edu.br) is in state (Active,None_) ",
                                "start": "2025-08-02 12:44:10 +0000",
                                "end": "2025-12-25 18:12:40 +0000",
                                "requested_end": "2025-12-25 18:12:40 +0000",
                                "units": 1,
                                "state": 4,
                                "pending_state": 11,
                                "sliver": {
                                "Name": "NS4",
                                "Type": "VM",
                                "Capacities": "{\"core\": 16, \"disk\": 1000, \"ram\": 16}",
                                "CapacityHints": "{\"instance_type\": \"fabric.c16.m16.d1000\"}",
                                "CapacityAllocations": "{\"core\": 16, \"disk\": 1000, \"ram\": 16}",
                                "LabelAllocations": "{\"instance\": \"instance-00000cf6\", \"instance_parent\": \"toky-w2.fabric-testbed.net\"}",
                                "ReservationInfo": "{\"reservation_id\": \"527af509-fb62-41ae-a0c0-191e3c7f6525\", \"reservation_state\": \"Active\"}",
                                "NodeMap": "[\"e2b1e451-45b4-4691-9527-aae18cad3b19\", \"FW696S3\"]",
                                "StitchNode": "false",
                                "ImageRef": "default_ubuntu_22,qcow2",
                                "MgmtIp": "133.69.160.21",
                                "Site": "TOKY"
                                }
                                }
                                
                                
                                Mert Cevik
                                Moderator

                                  Maintenance is completed.

                                Viewing 15 posts - 1 through 15 (of 198 total)