1. Tanay Maheshwari

Tanay Maheshwari

Forum Replies Created

Viewing 15 posts - 1 through 15 (of 16 total)
  • Author
    Posts
  • in reply to: Unable to SSH to one node in a 3-node slice #9569
    Tanay Maheshwari
    Participant

      closing.

      in reply to: Unable to SSH to one node in a 3-node slice #9567
      Tanay Maheshwari
      Participant

        Any help would be appreciated here! Thank you

        in reply to: Unable to SSH to one node in a 3-node slice #9566
        Tanay Maheshwari
        Participant

          I also tried the following:

          slice = fablib.get_slice(name=”CEPH_DOCA_POC”)
          slice.show()
          # slice.delete()
          DPU_NODE_NAME = “node3-dpu”

          node = slice.get_node(name=DPU_NODE_NAME)
          node.show()
          node.execute(“ip addr”)
          node.execute(“sudo ip addr add 192.168.50.2/24 dev enp8s0”)

          Fabric returned this error:

          File /opt/conda/lib/python3.11/site-packages/paramiko/transport.py:1130, in Transport.open_channel(self, kind, dest_addr, src_addr, window_size, max_packet_size, timeout)
             1128 if e is None:
             1129     e = SSHException("Unable to open channel.")
          -> 1130 raise e
          
          ChannelException: ChannelException(2, 'Connect failed')

          I believe the instance lost power and restarted, and somehow some of the important networking config was lost. Any help would be really appreciated!!

           

          in reply to: DPU Shut Down, can’t bring it back up #9547
          Tanay Maheshwari
          Participant

            Hi Mert,
            Just a suggestion – it would be great to have the DOCA Snap tutorial (like we have artifacts for p4 and compression) https://docs.nvidia.com/doca/archive/2-9-1/doca+snap-4+service+guide/index.html#src-3453016610_id-.DOCASNAP4ServiceGuidev2.9.1-Hot-plugFirmwareConfiguration

            in reply to: DPU Shut Down, can’t bring it back up #9544
            Tanay Maheshwari
            Participant

              I dont think this is a Bluefield problem, it might most definitely be a host problem.

              Steps:
              1. To view current configuration – sudo mlxconfig -d /dev/mst/mt41692_pciconf0 -e query
              2. To change a configuration value (in this case we change the values for PCI_SWITCH_EMULATION_ENABLE and  NVME_EMULATION_ENABLE  from 0 to 1) – sudo mlxconfig -d /dev/mst/mt41692_pciconf0 set PCI_SWITCH_EMULATION_ENABLE=1 NVME_EMULATION_ENABLE=1
              3. Based on the DOCA Documentation, perform a system reboot using – sudo mlxfwreset -d 03:00.0 -y -l 3 –sync 1 r , to apply configuration changes.
              4. After reboot, sudo mlxconfig -d /dev/mst/mt41692_pciconf0 -e query should display updated values

              “Hotplug is not guaranteed to work on AMD machines.”  – I did think that would be one of the reasons, but unfortunately I cant find any relevant logs at all. I will continue my troubleshooting and let you know. I will also post an issue on the DOCA devzone to see if NVIDIA has any clues about this.

              Thanks again Mert!

              EDIT: There should be no differences in the commands for Bluefield-2 or 3. ‘mt41692’ changes based on your device.
              Use ‘sudo mst start’ and ‘sudo mst status -v’ inside the DPU to find that out.

              in reply to: DPU Shut Down, can’t bring it back up #9540
              Tanay Maheshwari
              Participant

                Hi Mert,
                Unfortunately it didn’t update the firmware configurations. I am trying to figure out what is the blocker here.
                This is what I use to check if the firmware configurations have applied.  They still remain the same.
                sudo mlxconfig -d /dev/mst/mt41692_pciconf0 -e query (as seen in the screenshot, anything with an asterisk * is to be changed on reboot. It never does though)

                In my local setup with a Bluefield-2, a simple reboot (or) the above mentioned mlxfwreset command is sufficient to apply changes. Power cycle is not required.

                Thank you for taking the effort in helping me with this!

                in reply to: DPU Shut Down, can’t bring it back up #9536
                Tanay Maheshwari
                Participant

                  Hi Mert,
                  Is it possible to do a cold-reboot on the HAWI DPU to see if that applies firmware configurations?

                  in reply to: DPU Shut Down, can’t bring it back up #9530
                  Tanay Maheshwari
                  Participant

                    Hi Mert,
                    Apologies, but I had to delete the slice since I couldn’t get any stuff to work there anymore. Also, I was unable to create a DPU slice in SEAT (seems like the DPU is still shut down)

                    I created a new DPU slice on HAWI, and this command worked there with no timeout.
                    sudo mlxfwreset -d 03:00.0 -y -l 3 –sync 1 r

                    However, the firmware configuration refuses to update, even after running that command and doing a manual reboot.
                    Slice Details:

                    <caption>Slice</caption>

                    ID f761a02e-dae0-4122-b0a1-40b6cffc84e6
                    Name CEPH_DOCA_POC
                    Lease Expiration (UTC) 2026-03-02 01:00:29 +0000
                    Lease Start (UTC) 2026-02-25 00:53:21 +0000
                    Project ID 42b3494b-982f-4fe8-b160-26f28c3e33c0
                    State StableOK
                    Email mahesh88@purdue.edu
                    UserId 14e40626-117b-43fe-a9dd-89b0063d126d

                    Would love some guidance here.

                    Thanks,
                    Tanay

                    in reply to: Availability of DPU-powered SmartNICs #9137
                    Tanay Maheshwari
                    Participant

                      Hi Komal, are the BF3s available for testing now?
                      output_table = fablib.list_sites() i used this to list all the resources, but i couldnt find any.

                      Thanks!

                      in reply to: Bluefield NICs | FABRIC Webinar — November 11 at 3 PM ET #9132
                      Tanay Maheshwari
                      Participant

                        Are the DPUs available for use already?

                        in reply to: Availability of DPU-powered SmartNICs #8913
                        Tanay Maheshwari
                        Participant

                          Hi Komal,
                          Any updates on the Bluefiled integration? I checked the Fall updates and it doesn’t mention any Bluefields!

                          Thanks,
                          Tanay

                          in reply to: Availability of DPU-powered SmartNICs #8600
                          Tanay Maheshwari
                          Participant

                            Hi Komal,
                            Any updates on the DPU integration?

                            in reply to: Hardware Steering – Connectx6 #8266
                            Tanay Maheshwari
                            Participant

                              Hi Komal,
                              Just wanted to check if there is a timeline for integrating the BlueField DPUs.

                              Thanks,
                              Tanay

                              in reply to: Hardware Steering – Connectx6 #8133
                              Tanay Maheshwari
                              Participant

                                Hi Komal,
                                It’s still unresolved. Do we have any updates?

                                Thanks,
                                Tanay

                                in reply to: FABRIC-CERN in maintenance mode #7869
                                Tanay Maheshwari
                                Participant

                                  Hello, any ETA on this?

                                  Thanks,
                                  Tanay

                                Viewing 15 posts - 1 through 15 (of 16 total)