Home › Forums › FABRIC General Questions and Discussion › DPU Shut Down, can’t bring it back up
Tagged: Bluefield-3, DPU
- This topic has 9 replies, 2 voices, and was last updated 5 days, 2 hours ago by
Tanay Maheshwari.
-
AuthorPosts
-
February 24, 2026 at 4:33 pm #9527
Hello,
I am trying to enable the SNAP Service on my Bluefield-3 DPU.
Ref: https://docs.nvidia.com/doca/archive/2-9-1/doca+snap-4+service+guide/index.html#src-3453016610_id-.DOCASNAP4ServiceGuidev2.9.1-Hot-plugFirmwareConfigurationIt asks to enable the ‘PCI_SWITCH_EMULATION_NUM_PORT’ flag on the mlxconfig tool.
However, to apply that configuration, you need to perform a reboot/reset the DPU.
Ref: https://docs.nvidia.com/doca/sdk/nvidia-bluefield-reset-and-reboot-procedures/index.html
However, the ‘mlxfwreset -d03:00.0-y -l3–sync1r’ command timed out, and I followed the next option in the documentation, that is to run ‘shutdown -h’ on the DPU, and then bring it back up using the host (either reboot the host or use the mlxconfig).Rebooting the host didn’t work, neither did the mlxconfig command (doesn’t work on virtual machines). Now the Bluefiled-3 DPU stays shut down, and I have no idea on how to bring it back up.
Please help me out here! Also, and recommendations on how to perform configuration changes on the DPU?
Slice Details:
ID 1ecf4135-caae-405e-aa38-9470d757811d Name CEPH_DOCA_POC Lease Expiration (UTC) 2026-02-28 20:15:43 +0000 Lease Start (UTC) 2026-02-20 22:03:25 +0000 Project ID 42b3494b-982f-4fe8-b160-26f28c3e33c0 State StableOK Email mahesh88@purdue.edu UserId 14e40626-117b-43fe-a9dd-89b0063d126d -
This topic was modified 1 week ago by
Tanay Maheshwari.
February 25, 2026 at 12:18 am #9529Hello Tanay,
Can you share the state of your slice and slivers from your point of view? All slivers of the slice seem to be deleted.
Best regards,
MertFebruary 25, 2026 at 12:29 am #9530Hi Mert,
Apologies, but I had to delete the slice since I couldn’t get any stuff to work there anymore. Also, I was unable to create a DPU slice in SEAT (seems like the DPU is still shut down)I created a new DPU slice on HAWI, and this command worked there with no timeout.
sudo mlxfwreset -d 03:00.0 -y -l 3 –sync 1 rHowever, the firmware configuration refuses to update, even after running that command and doing a manual reboot.
Slice Details:<caption>Slice</caption>
ID f761a02e-dae0-4122-b0a1-40b6cffc84e6 Name CEPH_DOCA_POC Lease Expiration (UTC) 2026-03-02 01:00:29 +0000 Lease Start (UTC) 2026-02-25 00:53:21 +0000 Project ID 42b3494b-982f-4fe8-b160-26f28c3e33c0 State StableOK Email mahesh88@purdue.edu UserId 14e40626-117b-43fe-a9dd-89b0063d126d Would love some guidance here.
Thanks,
TanayFebruary 25, 2026 at 3:10 am #9531DPU on the SEAT node is recovered and it can be used for experiments.
For the firmware configuration, I need to read the documentation. I have no prior experience with these cards.
February 25, 2026 at 11:04 am #9536Hi Mert,
Is it possible to do a cold-reboot on the HAWI DPU to see if that applies firmware configurations?February 25, 2026 at 11:24 am #9537Hi Tanay,
I performed a power reset for the DPU. Can you please check if that worked well for the firmware configuration change?
ubuntu@localhost:~$ uname -a
Linux localhost.localdomain 5.15.0-1065-bluefield #67-Ubuntu SMP Tue Apr 22 11:10:15 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux
ubuntu@localhost:~$ uptime
16:19:21 up 1 min, 1 user, load average: 6.83, 2.15, 0.75
I will be able to describe the details about how I performed this later. Mainly, I had included the BMC bindings to the DPU integration, and I utilized this path, however I’m not sure very much sure about the terminology or specifics, just some intuitive actions so far. I’m also in touch with the FABRIC team about this item, so your input about the progress will be helpful for our further enhancements.
-
This reply was modified 6 days, 8 hours ago by
Mert Cevik.
-
This reply was modified 6 days, 8 hours ago by
Mert Cevik.
February 25, 2026 at 11:33 am #9540Hi Mert,
Unfortunately it didn’t update the firmware configurations. I am trying to figure out what is the blocker here.
This is what I use to check if the firmware configurations have applied. They still remain the same.
sudo mlxconfig -d /dev/mst/mt41692_pciconf0 -e query (as seen in the screenshot, anything with an asterisk * is to be changed on reboot. It never does though)In my local setup with a Bluefield-2, a simple reboot (or) the above mentioned mlxfwreset command is sufficient to apply changes. Power cycle is not required.
Thank you for taking the effort in helping me with this!
-
This reply was modified 6 days, 8 hours ago by
Tanay Maheshwari.
February 25, 2026 at 12:17 pm #9543Hi Tanay,
As a next step, we can try cold-rebooting the server that is holding the DPU, however this is not possible when other users have VM slivers running on it. I need to make special arrangements for that.
On our Development environment, we have a BlueField-2 DPU and we can perform all kinds of trials on it. You pointed the web page that describes how the configuration steps, but it can be even better if you provide us a complete list of commands for this configuration, so we can test it on the Development site. If there is any variance across BlueField-2 and BlueField-3, it will be good to indicate as well. Even, currently I’m preparing for additional BlueField-3 integrations, so I have BlueField-3 cards just delivered and I can use one card and test on the Development site with a BlueField-3 later.
And lastly, on the web page under How-Plug Firmware Configuration section, there is a note as “Hotplug is not guaranteed to work on AMD machines.” Servers on the FABRIC Testbed infrastructure are all AMD-based Dell R7525 servers. I’m not sure if this may be relevant to our issue.
Best regards,
MertFebruary 25, 2026 at 12:31 pm #9544I dont think this is a Bluefield problem, it might most definitely be a host problem.
Steps:
1. To view current configuration – sudo mlxconfig -d /dev/mst/mt41692_pciconf0 -e query
2. To change a configuration value (in this case we change the values for PCI_SWITCH_EMULATION_ENABLE and NVME_EMULATION_ENABLE from 0 to 1) – sudo mlxconfig -d /dev/mst/mt41692_pciconf0 set PCI_SWITCH_EMULATION_ENABLE=1 NVME_EMULATION_ENABLE=1
3. Based on the DOCA Documentation, perform a system reboot using – sudo mlxfwreset -d 03:00.0 -y -l 3 –sync 1 r , to apply configuration changes.
4. After reboot, sudo mlxconfig -d /dev/mst/mt41692_pciconf0 -e query should display updated values“Hotplug is not guaranteed to work on AMD machines.” – I did think that would be one of the reasons, but unfortunately I cant find any relevant logs at all. I will continue my troubleshooting and let you know. I will also post an issue on the DOCA devzone to see if NVIDIA has any clues about this.
Thanks again Mert!
EDIT: There should be no differences in the commands for Bluefield-2 or 3. ‘mt41692’ changes based on your device.
Use ‘sudo mst start’ and ‘sudo mst status -v’ inside the DPU to find that out.-
This reply was modified 6 days, 7 hours ago by
Tanay Maheshwari.
February 26, 2026 at 5:55 pm #9547Hi Mert,
Just a suggestion – it would be great to have the DOCA Snap tutorial (like we have artifacts for p4 and compression) https://docs.nvidia.com/doca/archive/2-9-1/doca+snap-4+service+guide/index.html#src-3453016610_id-.DOCASNAP4ServiceGuidev2.9.1-Hot-plugFirmwareConfiguration -
This topic was modified 1 week ago by
-
AuthorPosts
- You must be logged in to reply to this topic.