Forum Replies Created
-
AuthorPosts
-
Hi Mert,
Unfortunately, I haven’t yet been able to get a Connectx_7 at UCSD site. I understand, you probably cannot help. But still, just asking whether there is something that can be done so that we can continue our development and evaluation.
Thanks in advance!
Kind regards,
Plabon DuttaHi Mert,
That’s a great news. Thank you so much for getting back on this.
I was trying to allocate UCSD. I could see that there were
nic_connectx_7_100_availableavailable at that site. However, when I submit a slice, it says: Insufficient Resources.Kind regards,
Plabon DuttaBut I still got:
L25GC_BF3.txt: Sorry, you are not allowed to upload this file type.
Here is the gdrive link: https://drive.google.com/file/d/1J9GI8u68OzHdJJD-0dZinobddDSXQot3/view?usp=sharing
Thank you, Mert.
I have uploaded the notebook as a .txt file now.
Tried attaching my Jupyter notebook. Could not do so.
-
This reply was modified 1 week, 6 days ago by
Plabon Dutta.
Hi Mert,
Thanks a lot for your response.
About the steps from the pdf, I was doing more or less the same. There was two things missing in my case. When I changed those, it started working:
- I wasn’t doing node.config() from the nodebook. Now, once the DPU comes up after bfb-install, I do a node.config() and that brings back all the interfaces in a proper manner.
- After reboot, my host wasn’t coming back up, probably because of the sequence of the reboot or something else.
And because of these, I couldn’t see the actual representor devices on DPU side and couldn’t start the comm channel properly.
Now, about the current issue, the DOCA doc for the accelerated UPF does say that it was tested on FW version 32.43. However, the Flex Profile issue remains as we cannot do a cold reboot.
The FW update issue is there as well, maybe should not impact this example, but when we do the FW update via BFB Install, but it still says 32.43, that is a concern as well.
Best regards,
Plabon
Would like to add to my previous comment, my request to you would be to try and run the DOCA Accelerated UPF Application on Fabric. That would make things clear about the capabilities and issues.
Link: https://docs.nvidia.com/doca/sdk/doca-accelerated-upf-reference-application-guide/index.html
Thank you!
Plabon
Hi Mert,
I have been testing BF3s on Fabric for the past few days. As I said earlier, our intention was to offload the match/action using the DOCA Flow API to HW. Based on that, let me share some of our observations below:
- When we do the bfb-install from host, we get the following log after a successful installation:
Checking if local host has root access... Checking if rshim driver is running locally... Pushing bfb + cfg Collecting BlueField booting status. Press Ctrl+C to stop… INFO[PSC]: PSC BL1 START INFO[BL2]: start INFO[BL2]: boot mode (rshim) INFO[BL2]: VDD_CPU: 783 mV INFO[BL2]: VDDQ: 1118 mV INFO[BL2]: DDR POST passed INFO[BL2]: UEFI loaded INFO[BL31]: start INFO[BL31]: lifecycle GA Secured INFO[BL31]: runtime INFO[BL31]: MB ping success INFO[UEFI]: eMMC init INFO[UEFI]: eMMC probed INFO[UEFI]: UPVS valid INFO[UEFI]: PMI: updates started INFO[UEFI]: PMI: total updates: 1 INFO[UEFI]: PMI: updates completed, status 0 INFO[UEFI]: PCIe enum start INFO[UEFI]: PCIe enum end INFO[UEFI]: UEFI Secure Boot (enabled) INFO[UEFI]: Redfish enabled INFO[UEFI]: exit Boot Service INFO[MISC]: Found bf.cfg INFO[MISC]: Erasing eMMC drive: /dev/mmcblk0 INFO[MISC]: Erasing NVME drive: /dev/nvme0n1 INFO[MISC]: Ubuntu installation started INFO[MISC]: Installing OS image INFO[MISC]: Running bfb_modify_os from bf.cfg INFO[MISC]: Ubuntu installation completed WARN[MISC]: Skipping BMC components upgrade. INFO[MISC]: Updating NIC firmware... INFO[MISC]: NIC firmware update done: 32.48.1000 INFO[MISC]: Installation finishedHowever, once the BF3 is available, if we do
mlxfwmanager --queryfrom inside, we see:Querying Mellanox devices firmware ...Device #1: ---------- Device Type: BlueField3 Part Number: 900-9D3B6-00CC-EA_Ax Description: NVIDIA BlueField-3 B3210E E-Series FHHL DPU; 100GbE (default mode) / HDR100 IB; Dual-port QSFP112; PCIe Gen5.0 x16 with x16 PCIe extension option; 16 Arm cores; 32GB on-board DDR; integrated BMC; Crypto Enabled PSID: MT_0000001115 PCI Device Name: /dev/mst/mt41692_pciconf0 Base MAC: e89e494efd50 Versions: Current Available FW 32.43.2402 N/A PXE 3.7.0500 N/A UEFI 14.36.0021 N/A UEFI Virtio blk 22.4.0014 N/A UEFI Virtio net 21.4.0013 N/A Status: No matching image foundYou can see the NIC FW hasn’t updated to the latest version. 32.43 is way older which doesn’t support a few things. (more on this later)
- I compiled and ran our application, but it kept throwing errors like:
cannot get resource(ARGUMENT_64B),cannot create mlx5dv hws action for typeandfailed to create matcher, err -95. - Then, I tried running the Sample UPF Accelerator Application from Nvidia (https://docs.nvidia.com/doca/sdk/doca-accelerated-upf-reference-application-guide/index.html). I kept getting the same “failed to create matcher, err -95” errors no matter what:
ubuntu@localhost:/tmp/build/upf_accel$ sudo /tmp/build/upf_accel/doca_upf_accel -l 0-3 -- -a pci/03:00.0,dv_flow_en=2 -a pci/03:00.1,dv_flow_en=2 -f smf_policy.json [2026-04-04 02:24:47:350196][1433225472][DOCA][INF][CORE][doca_log.cpp:900] DOCA version 3.3.0109 [2026-04-04 02:24:47:350404][1433225472][DOCA][INF][UPF_ACCEL][upf_accel.c:1800][main] Starting UPF Acceleration app pid 26302 EAL: Detected CPU lcores: 16 EAL: Detected NUMA nodes: 1 EAL: Detected shared linkage of DPDK EAL: Multi-process socket /var/run/dpdk/rte/mp_socket EAL: Selected IOVA mode 'VA' [2026-04-04 02:24:47:499738][1433225472][DOCA][INF][UPF_ACCEL][upf_accel_json_parser.c:797][upf_accel_pdr_parse] Parsed PDR id=1 farId=1 first_urrid=1 first_qerid=1 PDI SI=0 QFI=0 teid_start=1073741824 teid_end=1073807359 IP=version 4, 101a8c0/255 UEIP=version 4, ac/0 SDF proto=0 from=IP version 4, 0/0:0-65535 to= IP version 4, 0/0:0-65535 [2026-04-04 02:24:47:499782][1433225472][DOCA][INF][UPF_ACCEL][upf_accel_json_parser.c:797][upf_accel_pdr_parse] Parsed PDR id=2 farId=2 first_urrid=2 first_qerid=2 PDI SI=2 QFI=0 teid_start=0 teid_end=0 IP=version 4, 0/0 UEIP=version 4, ac/0 SDF proto=0 from=IP version 4, 0/0:0-65535 to= IP version 4, 0/0:0-65535 [2026-04-04 02:24:47:499800][1433225472][DOCA][INF][UPF_ACCEL][upf_accel_json_parser.c:797][upf_accel_pdr_parse] Parsed PDR id=3 farId=1 first_urrid=1 first_qerid=1 PDI SI=0 QFI=1 teid_start=1073741824 teid_end=1073807359 IP=version 4, 101a8c0/255 UEIP=version 4, ac/0 SDF proto=0 from=IP version 4, 0/0:0-65535 to= IP version 4, 0/0:0-65535 [2026-04-04 02:24:47:499810][1433225472][DOCA][INF][UPF_ACCEL][upf_accel_json_parser.c:920][upf_accel_far_parse] Parsed FAR id=1: Outer Header ip=0/0 [2026-04-04 02:24:47:499817][1433225472][DOCA][INF][UPF_ACCEL][upf_accel_json_parser.c:920][upf_accel_far_parse] Parsed FAR id=2: Outer Header ip=101a8c0/255 [2026-04-04 02:24:47:499825][1433225472][DOCA][INF][UPF_ACCEL][upf_accel_json_parser.c:1000][upf_accel_urr_parse] Parsed URR id=1 volume_quota_total_volume=4000000000[2026-04-04 02:24:47:499831][1433225472][DOCA][INF][UPF_ACCEL][upf_accel_json_parser.c:1000][upf_accel_urr_parse] Parsed URR id=2 volume_quota_total_volume=200000 [2026-04-04 02:24:47:499841][1433225472][DOCA][INF][UPF_ACCEL][upf_accel_json_parser.c:1084][upf_accel_qer_parse] Parsed QER id=1 qfi=20 MBR dl=2000000000 ul=2000000000 [2026-04-04 02:24:47:499850][1433225472][DOCA][INF][UPF_ACCEL][upf_accel_json_parser.c:1084][upf_accel_qer_parse] Parsed QER id=2 qfi=20 MBR dl=2000000000 ul=2000000000 [2026-04-04 02:24:47:500926][1433225472][DOCA][WRN][FLOW][engine_model.c:88] adapting queue depth to 128. [2026-04-04 02:24:49:303964][1433225472][DOCA][ERR][FLOW::DRIVER::RUNTIME][nv_hws_wrappers.c:162] failed to create matcher, err -95 [2026-04-04 02:24:49:304007][1433225472][DOCA][ERR][FLOW][hws_matcher.c:1129] failed to create matcher reference for port 0 [2026-04-04 02:24:49:304026][1433225472][DOCA][ERR][FLOW][hws_pipe_core.c:226] failed creating matcher for pipe core - rc=-95 [2026-04-04 02:24:49:304031][1433225472][DOCA][ERR][FLOW][hws_pipe_core.c:282] failed pushing pipe core - matcher creation failed rc=-95 [2026-04-04 02:24:49:304034][1433225472][DOCA][ERR][FLOW][hws_pipe_core.c:597] failed building pipe core - matcher alloc rc=-95 [2026-04-04 02:24:49:304049][1433225472][DOCA][ERR][FLOW][engine_pipe_basic.c:292] Failed to create basic pipe core, rc = -95 [2026-04-04 02:24:49:304054][1433225472][DOCA][ERR][FLOW][engine_pipe.c:818] failed creating pipe - submit failed rc=(-95) [2026-04-04 02:24:49:305747][1433225472][DOCA][ERR][FLOW][doca_flow.c:1862] engine pipe creation failed, rc = -95 [2026-04-04 02:24:49:305774][1433225472][DOCA][ERR][UPF_ACCEL][upf_accel_pipeline.c:141][upf_accel_pipe_create] Failed to create UPF accel pipe: Operation not supported [2026-04-04 02:24:49:305785][1433225472][DOCA][ERR][UPF_ACCEL][upf_accel_pipeline.c:1260][upf_accel_pipe_7t_create] Failed to create 7t pipe: Operation not supported [2026-04-04 02:24:49:305790][1433225472][DOCA][ERR][UPF_ACCEL][upf_accel_pipeline.c:1865][upf_accel_pipeline_rx_create] Failed to create 7t inner IPv4 pipe: Operation not supported [2026-04-04 02:24:49:305795][1433225472][DOCA][ERR][UPF_ACCEL][upf_accel_pipeline.c:2138][upf_accel_pipeline_create] Failed to create rx pipeline in port 0: Operation not supported [2026-04-04 02:24:49:305800][1433225472][DOCA][ERR][UPF_ACCEL][upf_accel.c:1375][init_upf_accel] Failed to create pipeline: Operation not supported [2026-04-04 02:24:51:343602][1433225472][DOCA][ERR][UPF_ACCEL][upf_accel.c:1881][main] init_upf_accel() encountered an error: Operation not supported [2026-04-04 02:24:51:383440][1433225472][DOCA][WRN][DPDK_BRIDGE][doca_dpdk.cpp:562] DPDK dev already detached: 0 [2026-04-04 02:24:51:384573][1433225472][DOCA][WRN][DPDK_BRIDGE][doca_dpdk.cpp:562] DPDK dev already detached: 1 [2026-04-04 02:24:51:385668][1433225472][DOCA][INF][UPF_ACCEL][upf_accel.c:1912][main] UPF Acceleration app finished with errors
Now, the above link explicitly says:
The Flex Parser Profile is a setting that enables flexible protocol parsing on NVIDIA NICs/DPUs. To enable GTP protocol support, set the Flex Parser Profile to 3 using mlxconfig. This configuration is mandatory and must be done manually in the system.
…. Changing this configuration using mlxconfig requires a system (cold) reboot for the changes to take effect.
I tried running the command, reboot the host, reboot the DPU, nothing worked. Here’s what’s happening:
When the code attempted to insert a rule matching a GTP Tunnel ID (tun.gtp_teid), the DOCA driver crashed with err -95 (Operation not supported) and failed to create matcher. That’s because the hardware Parser, the physical silicon block doesn’t by default understand GTP TEID. Because the hardware couldn’t parse the field, the Hardware Steering (HWS) engine rejected the matchers.
So, as per our hypothesis, the following things have now blocked our progress:
- FW version not upgrading (maybe because of the lack of a cold reboot).
- Cannot enable FLEX Parser Profile 3 with a cold reboot.
It would be great if we can work together on this. Our development, testing and evaluation of the system is now stuck.
Thanks,
Plabon DuttaHi Mert,
Great news. I could make the DOCA Communication Channel run on SDK version 3.3.
Image:dpu_ubuntu_24
Site:MICHThe only remaining issue for now is the internet connectivity issue on BF3, as the NAT doesn’t work. I used
tinyproxyto doapt install. I will prepare a runbook for step-by-step process.
However, I would really appreciate if you clarify how you have been using internet on BF3 and whether you are doing anything about the slow mellanox repo.Thanks,
Plabon-
This reply was modified 2 weeks, 3 days ago by
Plabon Dutta.
Offtopic: How are you enabling internet connection on BF3, Mert? Previously I was doing something like below from the notebook:
stdout, stderr = node1.execute(f'sudo iptables -t nat -A POSTROUTING -o enp3s0 -j MASQUERADE', quiet = True)
stdout, stderr = node1.execute(f'sudo iptables -A FORWARD -i enp3s0 -o tmfifo_net0 -m state --state RELATED,ESTABLISHED -j ACCEPT', quiet = True)
stdout, stderr = node1.execute(f'sudo iptables -A FORWARD -i tmfifo_net0 -o enp3s0 -j ACCEPT', quiet = True)
stdout, stderr = node1.execute(f'sudo sysctl -w net.ipv4.ip_forward=1', quiet = True)
stdout, stderr = node1.execute(
"ssh ubuntu@192.168.100.2 \"echo -e 'nameserver 8.8.8.8\nnameserver 192.168.100.1' | sudo tee /etc/resolv.conf > /dev/null\""
)
But that doesn’t seem to work anymore because of IPv6 routing probably.
Also, the mellanox repo is notoriously slow sometimes on host anyway.
I just checked the docs and saw you are using DOCA Version 3.0.0058 both on Host and BF3. That’s great, because that means it’s even working with updated doca version as well. So, I assume you did a custom BFB install.
Anyway, it would be great if you share what exactly you did throughout.
Thanks in advance, Mert.
That’s great, Mert. This means it can actually work. I checked your attached files as well. Everything looks perfect. I have some questions though:
- Whats the image you are using? “default_ubuntu_22” or “dpu_ubuntu_22”?
- Did you do anything special to flash the BF3? Or just the
.configure()? - I’m assuming you were using the default DOCA version offered on Fabric, which is 2.9, right? Would you ming trying with the updated 3.3 SDK please?
- Any other steps you took to make the
--list-devswork on the host? Like a reboot or anything? - What’s the site you are using?
Hi Mert,
Sorry for the delayed response.
So, we currently don’t have access to another BF3. Thus, Fabric is our only option and I tried so many things with different combinations. Every time it’s one thing or another.
For the current experiment, I used
dpu_ubuntu_24image on Host. The DOCA SDK version was 2.9. And about DPU, I ran.configure()on the NIC from the notebook to up it with the default offered version. I have shared logs from that setup. Problem remains same: “No Doca Device Found” on host.Now, we have tried updating SDK to 3.3 and pushing the matching BFB image on BF3. Everything went well. But again, not consistent. I ran a lot of other commands like:
sudo apt-get install -y linux-modules-extra-$(uname -r)
sudo modprobe macsec
sudo /etc/init.d/openibd restart
sudo mst start
After that I reboot the Host and then I could see the doca devices on Host. Like below:
ubuntu@node2:~$ sudo /opt/mellanox/doca/tools/doca_caps --list-devs
PCI: 0000:07:00.0
ibdev_name mlx5_0
iface_name enp7s0
iface_index 3
pci_func_type PF
uplink_ib_port 1
mac_addr 02:84:f9:28:9a:89
ipv4_addr 0.0.0.0
ipv6_addr 0000:0000:0000:0000:0000:0000:0000:0000
gid_table_size 255
GID[0] fe80:0000:0000:0000:0084:f9ff:fe28:9a89
GID[1] fe80:0000:0000:0000:0084:f9ff:fe28:9a89
PCI: 0000:08:00.0
ibdev_name mlx5_1
iface_name enp8s0
iface_index 4
pci_func_type PF
uplink_ib_port 1
mac_addr 02:e4:3e:36:11:fd
ipv4_addr 0.0.0.0
ipv6_addr 0000:0000:0000:0000:0000:0000:0000:0000
gid_table_size 255
GID[0] fe80:0000:0000:0000:00e4:3eff:fe36:11fd
GID[1] fe80:0000:0000:0000:00e4:3eff:fe36:11fd
PCI: 0000:09:00.0
ibdev_name mlx5_2
iface_name enp9s0
iface_index 5
pci_func_type PF
uplink_ib_port 1
mac_addr 0a:1b:3c:24:93:83
ipv4_addr 0.0.0.0
ipv6_addr 0000:0000:0000:0000:0000:0000:0000:0000
gid_table_size 255
GID[0] fe80:0000:0000:0000:081b:3cff:fe24:9383
GID[1] fe80:0000:0000:0000:081b:3cff:fe24:9383
PCI: 0000:0a:00.0
ibdev_name mlx5_3
iface_name enp10s0np0
iface_index 6
pci_func_type PF
uplink_ib_port 1
mac_addr cc:40:f3:80:01:fc
ipv4_addr 0.0.0.0
ipv6_addr 0000:0000:0000:0000:0000:0000:0000:0000
gid_table_size 255
GID[0] fe80:0000:0000:0000:ce40:f3ff:fe80:01fc
GID[1] fe80:0000:0000:0000:ce40:f3ff:fe80:01fc
PCI: 0000:0b:00.0
ibdev_name mlx5_4
iface_name enp11s0np1
iface_index 7
pci_func_type PF
uplink_ib_port 1
mac_addr cc:40:f3:80:01:fd
ipv4_addr 0.0.0.0
ipv6_addr 0000:0000:0000:0000:0000:0000:0000:0000
gid_table_size 255
GID[0] fe80:0000:0000:0000:ce40:f3ff:fe80:01fd
GID[1] fe80:0000:0000:0000:ce40:f3ff:fe80:01fdI thought, that’s it. Now I’ll be able to run the DOCA communication channel sample application. But again, got error like:
[2026-03-28 23:58:02:734859][3853104960][DOCA][INF][CORE][doca_log.cpp:900] DOCA version 3.3.0109
[2026-03-28 23:58:03:068693][3853104960][DOCA][ERR][CORE][linux_devx_obj.cpp:115] Failed to create devx object with syndrome=0xe5300
[2026-03-28 23:58:03:069245][3853104960][DOCA][ERR][CORE][doca_dev.cpp:2699] Failed to create devx object: failed to allocate devx object wrapper with exception:
[2026-03-28 23:58:03:069322][3853104960][DOCA][ERR][CORE][doca_dev.cpp:2699] DOCA exception [DOCA_ERROR_DRIVER] with message Failed to create devx object
[2026-03-28 23:58:03:069349][3853104960][DOCA][ERR][COMCH][cc_devx_2.cpp:265] Failed to create channel connection object with error DOCA_ERROR_DRIVER
[2026-03-28 23:58:03:069368][3853104960][DOCA][ERR][COMCH][qp_channel_2.cpp:996] client registration failed for send side
[2026-03-28 23:58:03:069392][3853104960][DOCA][ERR][COMCH][doca_comm_channel_2.cpp:853] client registration failed for doca_comm_channel_2_ep_client_connect()
[2026-03-28 23:58:03:069410][3853104960][DOCA][ERR][COMCH][doca_comch_pe.cpp:413] failed to connect on client with error = DOCA_ERROR_CONNECTION_ABORTED
[2026-03-28 23:58:03:074705][3853104960][DOCA][ERR][CORE][doca_pe.cpp:1119] Progress engine 0x60bd204c1380: Failed to start context=0x60bd204c4bc0. err=DOCA_ERROR_CONNECTION_ABORTED
[2026-03-28 23:58:03:074732][3853104960][DOCA][ERR][COMCH_UTILS][comch_utils.c:535][comch_utils_fast_path_init] Failed to start comch client context: Connection aborted
About OS, I tried with
dpu_ubuntu_24,default_ubuntu_24,default_ubuntu_24. For the default OSs, I thought there might be OFED issue or something and that’s why it might not work. So, later tried with DPU images.If you think, your setup is working with Ubuntu 22, can you please try to run the DOCA Comm Channel Sample application?
Thank you,
Plabon
Hi Komal,
Thank you for the update! That’s exciting news—we’ll be eagerly waiting.
Best regards,
Plabon -
This reply was modified 1 week, 6 days ago by
-
AuthorPosts