1. DPU Shut Down, can’t bring it back up

DPU Shut Down, can’t bring it back up

Home Forums FABRIC General Questions and Discussion DPU Shut Down, can’t bring it back up

Tagged: ,

Viewing 10 posts - 1 through 10 (of 10 total)
  • Author
    Posts
  • #9527
    Tanay Maheshwari
    Participant

      Hello,
      I am trying to enable the SNAP Service on my Bluefield-3 DPU.
      Ref: https://docs.nvidia.com/doca/archive/2-9-1/doca+snap-4+service+guide/index.html#src-3453016610_id-.DOCASNAP4ServiceGuidev2.9.1-Hot-plugFirmwareConfiguration

      It asks to enable the ‘PCI_SWITCH_EMULATION_NUM_PORT’ flag on the mlxconfig tool.

      However, to apply that configuration, you need to perform a reboot/reset the DPU.
      Ref: https://docs.nvidia.com/doca/sdk/nvidia-bluefield-reset-and-reboot-procedures/index.html
      However, the ‘mlxfwreset -d 03:00.0 -y -l 3 –sync 1 r’ command timed out, and I followed the next option in the documentation, that is to run ‘shutdown -h’ on the DPU, and then bring it back up using the host (either reboot the host or use the mlxconfig).

      Rebooting the host didn’t work, neither did the mlxconfig command (doesn’t work on virtual machines). Now the Bluefiled-3 DPU stays shut down, and I have no idea on how to bring it back up.

      Please help me out here! Also, and recommendations on how to perform configuration changes on the DPU?

      Slice Details:

      ID 1ecf4135-caae-405e-aa38-9470d757811d
      Name CEPH_DOCA_POC
      Lease Expiration (UTC) 2026-02-28 20:15:43 +0000
      Lease Start (UTC) 2026-02-20 22:03:25 +0000
      Project ID 42b3494b-982f-4fe8-b160-26f28c3e33c0
      State StableOK
      Email mahesh88@purdue.edu
      UserId 14e40626-117b-43fe-a9dd-89b0063d126d
      #9529
      Mert Cevik
      Moderator

        Hello Tanay,

        Can you share the state of your slice and slivers from your point of view? All slivers of the slice seem to be deleted.

        Best regards,
        Mert

        #9530
        Tanay Maheshwari
        Participant

          Hi Mert,
          Apologies, but I had to delete the slice since I couldn’t get any stuff to work there anymore. Also, I was unable to create a DPU slice in SEAT (seems like the DPU is still shut down)

          I created a new DPU slice on HAWI, and this command worked there with no timeout.
          sudo mlxfwreset -d 03:00.0 -y -l 3 –sync 1 r

          However, the firmware configuration refuses to update, even after running that command and doing a manual reboot.
          Slice Details:

          <caption>Slice</caption>

          ID f761a02e-dae0-4122-b0a1-40b6cffc84e6
          Name CEPH_DOCA_POC
          Lease Expiration (UTC) 2026-03-02 01:00:29 +0000
          Lease Start (UTC) 2026-02-25 00:53:21 +0000
          Project ID 42b3494b-982f-4fe8-b160-26f28c3e33c0
          State StableOK
          Email mahesh88@purdue.edu
          UserId 14e40626-117b-43fe-a9dd-89b0063d126d

          Would love some guidance here.

          Thanks,
          Tanay

          #9531
          Mert Cevik
          Moderator

            DPU on the SEAT node is recovered and it can be used for experiments.

            For the firmware configuration, I need to read the documentation. I have no prior experience with these cards.

            #9536
            Tanay Maheshwari
            Participant

              Hi Mert,
              Is it possible to do a cold-reboot on the HAWI DPU to see if that applies firmware configurations?

              #9537
              Mert Cevik
              Moderator

                Hi Tanay,

                I performed a power reset for the DPU. Can you please check if that worked well for the firmware configuration change?


                ubuntu@localhost:~$ uname -a
                Linux localhost.localdomain 5.15.0-1065-bluefield #67-Ubuntu SMP Tue Apr 22 11:10:15 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux
                ubuntu@localhost:~$ uptime
                16:19:21 up 1 min, 1 user, load average: 6.83, 2.15, 0.75

                I will be able to describe the details about how I performed this later. Mainly, I had included the BMC bindings to the DPU integration, and I utilized this path, however I’m not sure very much sure about the terminology or specifics, just some intuitive actions so far. I’m also in touch with the FABRIC team about this item, so your input about the progress will be helpful for our further enhancements.

                • This reply was modified 6 days, 10 hours ago by Mert Cevik.
                • This reply was modified 6 days, 10 hours ago by Mert Cevik.
                #9540
                Tanay Maheshwari
                Participant

                  Hi Mert,
                  Unfortunately it didn’t update the firmware configurations. I am trying to figure out what is the blocker here.
                  This is what I use to check if the firmware configurations have applied.  They still remain the same.
                  sudo mlxconfig -d /dev/mst/mt41692_pciconf0 -e query (as seen in the screenshot, anything with an asterisk * is to be changed on reboot. It never does though)

                  In my local setup with a Bluefield-2, a simple reboot (or) the above mentioned mlxfwreset command is sufficient to apply changes. Power cycle is not required.

                  Thank you for taking the effort in helping me with this!

                  #9543
                  Mert Cevik
                  Moderator

                    Hi Tanay,

                    As a next step, we can try cold-rebooting the server that is holding the DPU, however this is not possible when other users have VM slivers running on it. I need to make special arrangements for that.

                    On our Development environment, we have a BlueField-2 DPU and we can perform all kinds of trials on it. You pointed the web page that describes how the configuration steps, but it can be even better if you provide us a complete list of commands for this configuration, so we can test it on the Development site. If there is any variance across BlueField-2 and BlueField-3, it will be good to indicate as well. Even, currently I’m preparing for additional BlueField-3 integrations, so I have BlueField-3 cards just delivered and I can use one card and test on the Development site with a BlueField-3 later.

                    And lastly, on the web page under How-Plug Firmware Configuration section, there is a note as “Hotplug is not guaranteed to work on AMD machines.” Servers on the FABRIC Testbed infrastructure are all AMD-based Dell R7525 servers. I’m not sure if this may be relevant to our issue.

                    Best regards,
                    Mert

                     

                     

                    #9544
                    Tanay Maheshwari
                    Participant

                      I dont think this is a Bluefield problem, it might most definitely be a host problem.

                      Steps:
                      1. To view current configuration – sudo mlxconfig -d /dev/mst/mt41692_pciconf0 -e query
                      2. To change a configuration value (in this case we change the values for PCI_SWITCH_EMULATION_ENABLE and  NVME_EMULATION_ENABLE  from 0 to 1) – sudo mlxconfig -d /dev/mst/mt41692_pciconf0 set PCI_SWITCH_EMULATION_ENABLE=1 NVME_EMULATION_ENABLE=1
                      3. Based on the DOCA Documentation, perform a system reboot using – sudo mlxfwreset -d 03:00.0 -y -l 3 –sync 1 r , to apply configuration changes.
                      4. After reboot, sudo mlxconfig -d /dev/mst/mt41692_pciconf0 -e query should display updated values

                      “Hotplug is not guaranteed to work on AMD machines.”  – I did think that would be one of the reasons, but unfortunately I cant find any relevant logs at all. I will continue my troubleshooting and let you know. I will also post an issue on the DOCA devzone to see if NVIDIA has any clues about this.

                      Thanks again Mert!

                      EDIT: There should be no differences in the commands for Bluefield-2 or 3. ‘mt41692’ changes based on your device.
                      Use ‘sudo mst start’ and ‘sudo mst status -v’ inside the DPU to find that out.

                      #9547
                      Tanay Maheshwari
                      Participant

                        Hi Mert,
                        Just a suggestion – it would be great to have the DOCA Snap tutorial (like we have artifacts for p4 and compression) https://docs.nvidia.com/doca/archive/2-9-1/doca+snap-4+service+guide/index.html#src-3453016610_id-.DOCASNAP4ServiceGuidev2.9.1-Hot-plugFirmwareConfiguration

                      Viewing 10 posts - 1 through 10 (of 10 total)
                      • You must be logged in to reply to this topic.