1. BlueField-3 host-DPU communication issue on FABRIC

BlueField-3 host-DPU communication issue on FABRIC

Home Forums FABRIC General Questions and Discussion BlueField-3 host-DPU communication issue on FABRIC

Tagged: , ,

Viewing 15 posts - 1 through 15 (of 23 total)
  • Author
    Posts
  • #9612
    Plabon Dutta
    Participant

      We are working on a project that offloads UPF to BlueField-3 DPU, and our design requires host-to-DPU communication using the NVIDIA DOCA communication channel API. In the FABRIC environment, this API does not work for us even with the default DOCA 2.9 SDK provided by FABRIC. We are running sample application provided by Nvidia, so no issue with our application logic. [https://docs.nvidia.com/doca/sdk/doca-secure-channel-application-guide/index.html]

      And the core issue is not basically the communication API, it’s probably related to driver or firmware synchronization.

      Tested Image: dpu_ubuntu_24

      Tested Site: FIU, DALL

      On Host:

      ubuntu@node2:~$ lspci | grep mellanox -i
      07:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
      08:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
      09:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function]
      0a:00.0 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01)
      0b:00.0 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01)
      0c:00.0 DMA controller: Mellanox Technologies MT43244 BlueField-3 SoC Management Interface (rev 01)
      ubuntu@node2: sudo /opt/mellanox/doca/tools/doca_caps --list-devs
      No DOCA device was found
      

      This shouldn’t happen. In a healthy setup, it would list doca devices and in this case one of them definitely being 0a:00.0. We tried rebooting the host as well. The issue persists.

      In our setup, the PCIe device is visible from the host, but DOCA on the host does not enumerate usable DOCA devices correctly, and the communication-channel application fails to initialize.

      We tried updating the DOCA SDK to 3.3 and relevant BFB image as well. After BFB install, we rebooted the host, could see the doca devices then on Host. However, the moment we try to start the DOCA communication channel using the PCIe, it doesn’t work. We observed host-side DevX object creation failures and connection-aborted errors while trying to run the DOCA secure channel client. We tried with every possible PCIe combination, so no, wrong PCIe isn’t the reason either.

      
      ubuntu@node2:/tmp/build/secure_channel$ sudo ./doca_secure_channel -s 256 -n 10 -p 0000:0a:00.0
      [2026-03-28 23:58:02:734859][3853104960][DOCA][INF][CORE][doca_log.cpp:900] DOCA version 3.3.0109
      [2026-03-28 23:58:03:068693][3853104960][DOCA][ERR][CORE][linux_devx_obj.cpp:115] Failed to create devx object with syndrome=0xe5300
      [2026-03-28 23:58:03:069245][3853104960][DOCA][ERR][CORE][doca_dev.cpp:2699] Failed to create devx object: failed to allocate devx object wrapper with exception:
      [2026-03-28 23:58:03:069322][3853104960][DOCA][ERR][CORE][doca_dev.cpp:2699] DOCA exception [DOCA_ERROR_DRIVER] with message Failed to create devx object
      [2026-03-28 23:58:03:069349][3853104960][DOCA][ERR][COMCH][cc_devx_2.cpp:265] Failed to create channel connection object with error DOCA_ERROR_DRIVER
      [2026-03-28 23:58:03:069368][3853104960][DOCA][ERR][COMCH][qp_channel_2.cpp:996] client registration failed for send side
      [2026-03-28 23:58:03:069392][3853104960][DOCA][ERR][COMCH][doca_comm_channel_2.cpp:853] client registration failed for doca_comm_channel_2_ep_client_connect()
      [2026-03-28 23:58:03:069410][3853104960][DOCA][ERR][COMCH][doca_comch_pe.cpp:413] failed to connect on client with error = DOCA_ERROR_CONNECTION_ABORTED
      [2026-03-28 23:58:03:074705][3853104960][DOCA][ERR][CORE][doca_pe.cpp:1119] Progress engine 0x60bd204c1380: Failed to start context=0x60bd204c4bc0. err=DOCA_ERROR_CONNECTION_ABORTED
      [2026-03-28 23:58:03:074732][3853104960][DOCA][ERR][COMCH_UTILS][comch_utils.c:535][comch_utils_fast_path_init] Failed to start comch client context: Connection aborted
      
      

      Because the expected host-side DOCA-visible interface/device path is missing or nonfunctional, we suspect there may be a low-level firmware, driver binding, or host-exposure issue in the current FABRIC setup. At this point, it seems possible that recovery may require a hard reboot of the physical node, but we only have access to the VM and not to the underlying physical machine.

      Could you please let us know:

      • Whether there is any way for us to request access to a physical bare-metal node for BlueField/DPU testing.
      • Whether the current FABRIC VM setup fully supports host-side BlueField DOCA communication-channel use cases.

      Thanks in advance. Any help or guidance on the way forward would be greatly appreciated. We would be very grateful for any assistance, as we have not been able to resolve this issue so far.

      #9616
      Mert Cevik
      Moderator

        Hello Plabon,

        Whether there is any way for us to request access to a physical bare-metal node for BlueField/DPU testing.

        FABRIC Testbed has only VM resources and does not provide physical bare-metal nodes for BlueField/DPU testing.

        Whether the current FABRIC VM setup fully supports host-side BlueField DOCA communication-channel use cases.

        We don’t have a specific statement that FABRIC Testbed fully supports host-side BlueField DOCA communication-channel use cases. However, as you already know, on FABRIC Testbed, you can create VMs and the network cards are attached via PCI passthrough.

        For potential “low-level firmware, driver binding, or host-exposure issue in the current FABRIC setup”, we can work with you to identify the problems. We need some information from your “healthy setup” with respect to the DOCA (host and DPU) versions, firmware version from the DPU (specifically the output from flint -d MST_DEVICE q. It will be also helpful if you share the outputs for the following items:

        • lspci (both host and DPU)
        • doca_caps –list-devs (both host and DPU)
        • doca_caps –list-rep-devs (from the DPU)
        • mlxconfig -d <MST_DEVICE> q INTERNAL_CPU_OFFLOAD_ENGINE (both host and DPU)

         

        • This reply was modified 1 month ago by Mert Cevik.
        • This reply was modified 1 month ago by Mert Cevik.
        #9621
        Mert Cevik
        Moderator

          Hello Plabon,

          If you’re considering sharing the output from your system with us, can you also include the OS version?

          On my test setup on the FABRIC Testbed, Ubuntu 24 consistently gave me an error with loading mlx5_ib module, however on a Ubuntu 22 “host” setup, it worked well and I can see the DOCA devices. It will be helpful for us to get some information from a reference system, if you share the information from your side.

          • This reply was modified 1 month ago by Mert Cevik.
          #9623
          Plabon Dutta
          Participant

            Hi Mert,

            Sorry for the delayed response.

            So, we currently don’t have access to another BF3. Thus, Fabric is our only option and I tried so many things with different combinations. Every time it’s one thing or another.

            For the current experiment, I used dpu_ubuntu_24 image on Host. The DOCA SDK version was 2.9. And about DPU, I ran .configure() on the NIC from the notebook to up it with the default offered version. I have shared logs from that setup. Problem remains same: “No Doca Device Found” on host.

            Now, we have tried updating SDK to 3.3 and pushing the matching BFB image on BF3. Everything went well. But again, not consistent. I ran a lot of other commands like:


            sudo apt-get install -y linux-modules-extra-$(uname -r)
            sudo modprobe macsec
            sudo /etc/init.d/openibd restart
            sudo mst start

            After that I reboot the Host and then I could see the doca devices on Host. Like below:

            ubuntu@node2:~$ sudo /opt/mellanox/doca/tools/doca_caps --list-devs
            PCI: 0000:07:00.0
            ibdev_name mlx5_0
            iface_name enp7s0
            iface_index 3
            pci_func_type PF
            uplink_ib_port 1
            mac_addr 02:84:f9:28:9a:89
            ipv4_addr 0.0.0.0
            ipv6_addr 0000:0000:0000:0000:0000:0000:0000:0000
            gid_table_size 255
            GID[0] fe80:0000:0000:0000:0084:f9ff:fe28:9a89
            GID[1] fe80:0000:0000:0000:0084:f9ff:fe28:9a89
            PCI: 0000:08:00.0
            ibdev_name mlx5_1
            iface_name enp8s0
            iface_index 4
            pci_func_type PF
            uplink_ib_port 1
            mac_addr 02:e4:3e:36:11:fd
            ipv4_addr 0.0.0.0
            ipv6_addr 0000:0000:0000:0000:0000:0000:0000:0000
            gid_table_size 255
            GID[0] fe80:0000:0000:0000:00e4:3eff:fe36:11fd
            GID[1] fe80:0000:0000:0000:00e4:3eff:fe36:11fd
            PCI: 0000:09:00.0
            ibdev_name mlx5_2
            iface_name enp9s0
            iface_index 5
            pci_func_type PF
            uplink_ib_port 1
            mac_addr 0a:1b:3c:24:93:83
            ipv4_addr 0.0.0.0
            ipv6_addr 0000:0000:0000:0000:0000:0000:0000:0000
            gid_table_size 255
            GID[0] fe80:0000:0000:0000:081b:3cff:fe24:9383
            GID[1] fe80:0000:0000:0000:081b:3cff:fe24:9383
            PCI: 0000:0a:00.0
            ibdev_name mlx5_3
            iface_name enp10s0np0
            iface_index 6
            pci_func_type PF
            uplink_ib_port 1
            mac_addr cc:40:f3:80:01:fc
            ipv4_addr 0.0.0.0
            ipv6_addr 0000:0000:0000:0000:0000:0000:0000:0000
            gid_table_size 255
            GID[0] fe80:0000:0000:0000:ce40:f3ff:fe80:01fc
            GID[1] fe80:0000:0000:0000:ce40:f3ff:fe80:01fc
            PCI: 0000:0b:00.0
            ibdev_name mlx5_4
            iface_name enp11s0np1
            iface_index 7
            pci_func_type PF
            uplink_ib_port 1
            mac_addr cc:40:f3:80:01:fd
            ipv4_addr 0.0.0.0
            ipv6_addr 0000:0000:0000:0000:0000:0000:0000:0000
            gid_table_size 255
            GID[0] fe80:0000:0000:0000:ce40:f3ff:fe80:01fd
            GID[1] fe80:0000:0000:0000:ce40:f3ff:fe80:01fd

            I thought, that’s it. Now I’ll be able to run the DOCA communication channel sample application. But again, got error like:

            [2026-03-28 23:58:02:734859][3853104960][DOCA][INF][CORE][doca_log.cpp:900] DOCA version 3.3.0109
            [2026-03-28 23:58:03:068693][3853104960][DOCA][ERR][CORE][linux_devx_obj.cpp:115] Failed to create devx object with syndrome=0xe5300
            [2026-03-28 23:58:03:069245][3853104960][DOCA][ERR][CORE][doca_dev.cpp:2699] Failed to create devx object: failed to allocate devx object wrapper with exception:
            [2026-03-28 23:58:03:069322][3853104960][DOCA][ERR][CORE][doca_dev.cpp:2699] DOCA exception [DOCA_ERROR_DRIVER] with message Failed to create devx object
            [2026-03-28 23:58:03:069349][3853104960][DOCA][ERR][COMCH][cc_devx_2.cpp:265] Failed to create channel connection object with error DOCA_ERROR_DRIVER
            [2026-03-28 23:58:03:069368][3853104960][DOCA][ERR][COMCH][qp_channel_2.cpp:996] client registration failed for send side
            [2026-03-28 23:58:03:069392][3853104960][DOCA][ERR][COMCH][doca_comm_channel_2.cpp:853] client registration failed for doca_comm_channel_2_ep_client_connect()
            [2026-03-28 23:58:03:069410][3853104960][DOCA][ERR][COMCH][doca_comch_pe.cpp:413] failed to connect on client with error = DOCA_ERROR_CONNECTION_ABORTED
            [2026-03-28 23:58:03:074705][3853104960][DOCA][ERR][CORE][doca_pe.cpp:1119] Progress engine 0x60bd204c1380: Failed to start context=0x60bd204c4bc0. err=DOCA_ERROR_CONNECTION_ABORTED
            [2026-03-28 23:58:03:074732][3853104960][DOCA][ERR][COMCH_UTILS][comch_utils.c:535][comch_utils_fast_path_init] Failed to start comch client context: Connection aborted

            About OS, I tried with dpu_ubuntu_24, default_ubuntu_24, default_ubuntu_24. For the default OSs, I thought there might be OFED issue or something and that’s why it might not work. So, later tried with DPU images.

            If you think, your setup is working with Ubuntu 22, can you please try to run the DOCA Comm Channel Sample application?

             

            Thank you,

            Plabon

             

             

            #9626
            Mert Cevik
            Moderator

              Thank you, I can see the files that you uploaded. I will check them out.

              I’m posting the outputs from my slice below:

              1. Start the server process on the DPU
              (server waits on client, then the connection is established following the client process is started on the host)

              ubuntu@localhost:~$ cd /tmp/build/secure_channel/
              ubuntu@localhost:/tmp/build/secure_channel$ sudo ./doca_secure_channel -s 256 -n 10 -p 03:00.0 -r 81:00.0
              [00:33:01:945954][509848][DOCA][INF][comch_utils.c:464][comch_utils_fast_path_init] Server waiting on a client to connect

               

              [00:33:26:620828][509848][DOCA][INF][comch_utils.c:472][comch_utils_fast_path_init] Server connection established
              [00:33:26:700884][509848][DOCA][INF][secure_channel_core.c:1012][sc_start] Producer sent 10 messages in approximately 0.0865 milliseconds
              [00:33:26:700914][509848][DOCA][INF][secure_channel_core.c:1015][sc_start] Consumer received 10 messages in approximately 0.0019 milliseconds
              ubuntu@localhost:/tmp/build/secure_channel$

              2. Start the client process on the Host

              ubuntu@Node1:/tmp/build/secure_channel$ sudo ./doca_secure_channel -s 256 -n 10 -p 07:00.0
              [00:33:55:727260][4244][DOCA][INF][secure_channel_core.c:1012][sc_start] Producer sent 10 messages in approximately 0.0094 milliseconds
              [00:33:55:727284][4244][DOCA][INF][secure_channel_core.c:1015][sc_start] Consumer received 10 messages in approximately 0.0038 milliseconds

              I’m also attaching two txt files that show the output (versions, devices etc) from the DPU and Host. (I realized that the attachments are visible when you’re logged into the forum section, without login, the page does not indicate any attachments)

               

              #9629
              Plabon Dutta
              Participant

                That’s great, Mert. This means it can actually work. I checked your attached files as well. Everything looks perfect. I have some questions though:

                1. Whats the image you are using? “default_ubuntu_22” or “dpu_ubuntu_22”?
                2. Did you do anything special to flash the BF3? Or just the .configure()?
                3. I’m assuming you were using the default DOCA version offered on Fabric, which is 2.9, right? Would you ming trying with the updated 3.3 SDK please?
                4. Any other steps you took to make the --list-devs work on the host? Like a reboot or anything?
                5. What’s the site you are using?

                 

                 

                #9630
                Plabon Dutta
                Participant

                  I just checked the docs and saw you are using DOCA Version 3.0.0058 both on Host and BF3. That’s great, because that means it’s even working with updated doca version as well. So, I assume you did a custom BFB install.

                  Anyway, it would be great if you share what exactly you did throughout.

                   

                  Thanks in advance, Mert.

                   

                   

                  #9631
                  Plabon Dutta
                  Participant

                     

                    #9633
                    Plabon Dutta
                    Participant

                      Offtopic: How are you enabling internet connection on BF3, Mert? Previously I was doing something like below from the notebook:

                      stdout, stderr = node1.execute(f'sudo iptables -t nat -A POSTROUTING -o enp3s0 -j MASQUERADE', quiet = True)
                      stdout, stderr = node1.execute(f'sudo iptables -A FORWARD -i enp3s0 -o tmfifo_net0 -m state --state RELATED,ESTABLISHED -j ACCEPT', quiet = True)
                      stdout, stderr = node1.execute(f'sudo iptables -A FORWARD -i tmfifo_net0 -o enp3s0 -j ACCEPT', quiet = True)
                      stdout, stderr = node1.execute(f'sudo sysctl -w net.ipv4.ip_forward=1', quiet = True)
                      stdout, stderr = node1.execute(
                      "ssh ubuntu@192.168.100.2 \"echo -e 'nameserver 8.8.8.8\nnameserver 192.168.100.1' | sudo tee /etc/resolv.conf > /dev/null\""
                      )

                      But that doesn’t seem to work anymore because of IPv6 routing probably.

                      Also, the mellanox repo is notoriously slow sometimes on host anyway.

                      #9634
                      Plabon Dutta
                      Participant

                        Hi Mert,

                        Great news. I could make the DOCA Communication Channel run on SDK version 3.3.
                        Image: dpu_ubuntu_24
                        Site: MICH

                        The only remaining issue for now is the internet connectivity issue on BF3, as the NAT doesn’t work. I used tinyproxy to do apt install. I will prepare a runbook for step-by-step process.
                        However, I would really appreciate if you clarify how you have been using internet on BF3 and whether you are doing anything about the slow mellanox repo.

                         

                        Thanks,
                        Plabon

                         

                         

                        #9636
                        Mert Cevik
                        Moderator

                          Hi Plabon,

                          I compiled a document from my notes and I’m attaching it as a PDF file. It includes all steps and versions that I used. You should be able to execute all commands as they are and be able to repeat the same setup.

                          I tried to verify the document, but due to time constraints, I may switch to actual work and my responses may be delayed, therefore I’m sharing this right away. I will be following the thread for updates/errors.

                           

                          #9648
                          Plabon Dutta
                          Participant

                            Hi Mert,

                            I have been testing BF3s on Fabric for the past few days. As I said earlier, our intention was to offload the match/action using the DOCA Flow API to HW. Based on that, let me share some of our observations below:

                            1. When we do the bfb-install from host, we get the following log after a successful installation:
                              
                              Checking if local host has root access...
                              Checking if rshim driver is running locally...
                              Pushing bfb + cfg
                              Collecting BlueField booting status. Press Ctrl+C to stop…
                              INFO[PSC]: PSC BL1 START
                              INFO[BL2]: start
                              INFO[BL2]: boot mode (rshim)
                              INFO[BL2]: VDD_CPU: 783 mV
                              INFO[BL2]: VDDQ: 1118 mV
                              INFO[BL2]: DDR POST passed
                              INFO[BL2]: UEFI loaded
                              INFO[BL31]: start
                              INFO[BL31]: lifecycle GA Secured
                              INFO[BL31]: runtime
                              INFO[BL31]: MB ping success
                              INFO[UEFI]: eMMC init
                              INFO[UEFI]: eMMC probed
                              INFO[UEFI]: UPVS valid
                              INFO[UEFI]: PMI: updates started
                              INFO[UEFI]: PMI: total updates: 1
                              INFO[UEFI]: PMI: updates completed, status 0
                              INFO[UEFI]: PCIe enum start
                              INFO[UEFI]: PCIe enum end
                              INFO[UEFI]: UEFI Secure Boot (enabled)
                              INFO[UEFI]: Redfish enabled
                              INFO[UEFI]: exit Boot Service
                              INFO[MISC]: Found bf.cfg
                              INFO[MISC]: Erasing eMMC drive: /dev/mmcblk0
                              INFO[MISC]: Erasing NVME drive: /dev/nvme0n1
                              INFO[MISC]: Ubuntu installation started
                              INFO[MISC]: Installing OS image
                              INFO[MISC]: Running bfb_modify_os from bf.cfg
                              INFO[MISC]: Ubuntu installation completed
                              WARN[MISC]: Skipping BMC components upgrade.
                              INFO[MISC]: Updating NIC firmware...
                              INFO[MISC]: NIC firmware update done: 32.48.1000
                              INFO[MISC]: Installation finished
                              

                              However, once the BF3 is available, if we do mlxfwmanager --query from inside, we see:

                              
                              Querying Mellanox devices firmware ...Device #1:
                              ----------
                              
                              Device Type: BlueField3
                              Part Number: 900-9D3B6-00CC-EA_Ax
                              Description: NVIDIA BlueField-3 B3210E E-Series FHHL DPU; 100GbE (default mode) / HDR100 IB; Dual-port QSFP112; PCIe Gen5.0 x16 with x16 PCIe extension option; 16 Arm cores; 32GB on-board DDR; integrated BMC; Crypto Enabled
                              PSID: MT_0000001115
                              PCI Device Name: /dev/mst/mt41692_pciconf0
                              Base MAC: e89e494efd50
                              Versions: Current Available
                              FW 32.43.2402 N/A
                              PXE 3.7.0500 N/A
                              UEFI 14.36.0021 N/A
                              UEFI Virtio blk 22.4.0014 N/A
                              UEFI Virtio net 21.4.0013 N/A
                              
                              Status: No matching image found
                              
                              

                              You can see the NIC FW hasn’t updated to the latest version. 32.43 is way older which doesn’t support a few things. (more on this later)

                            2. I compiled and ran our application, but it kept throwing errors like: cannot get resource(ARGUMENT_64B) , cannot create mlx5dv hws action for type and failed to create matcher, err -95.
                            3. Then, I tried running the Sample UPF Accelerator Application from Nvidia (https://docs.nvidia.com/doca/sdk/doca-accelerated-upf-reference-application-guide/index.html). I kept getting the same “failed to create matcher, err -95” errors no matter what:
                              
                              ubuntu@localhost:/tmp/build/upf_accel$ sudo /tmp/build/upf_accel/doca_upf_accel -l 0-3 -- -a pci/03:00.0,dv_flow_en=2 -a pci/03:00.1,dv_flow_en=2 -f smf_policy.json
                              [2026-04-04 02:24:47:350196][1433225472][DOCA][INF][CORE][doca_log.cpp:900] DOCA version 3.3.0109
                              [2026-04-04 02:24:47:350404][1433225472][DOCA][INF][UPF_ACCEL][upf_accel.c:1800][main] Starting UPF Acceleration app pid 26302
                              EAL: Detected CPU lcores: 16
                              EAL: Detected NUMA nodes: 1
                              EAL: Detected shared linkage of DPDK
                              EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
                              EAL: Selected IOVA mode 'VA'
                              [2026-04-04 02:24:47:499738][1433225472][DOCA][INF][UPF_ACCEL][upf_accel_json_parser.c:797][upf_accel_pdr_parse] Parsed PDR id=1
                              farId=1 first_urrid=1 first_qerid=1
                              PDI SI=0 QFI=0 teid_start=1073741824 teid_end=1073807359 IP=version 4, 101a8c0/255 UEIP=version 4, ac/0
                              SDF proto=0 from=IP version 4, 0/0:0-65535 to= IP version 4, 0/0:0-65535
                              [2026-04-04 02:24:47:499782][1433225472][DOCA][INF][UPF_ACCEL][upf_accel_json_parser.c:797][upf_accel_pdr_parse] Parsed PDR id=2
                              farId=2 first_urrid=2 first_qerid=2
                              PDI SI=2 QFI=0 teid_start=0 teid_end=0 IP=version 4, 0/0 UEIP=version 4, ac/0
                              SDF proto=0 from=IP version 4, 0/0:0-65535 to= IP version 4, 0/0:0-65535
                              [2026-04-04 02:24:47:499800][1433225472][DOCA][INF][UPF_ACCEL][upf_accel_json_parser.c:797][upf_accel_pdr_parse] Parsed PDR id=3
                              farId=1 first_urrid=1 first_qerid=1
                              PDI SI=0 QFI=1 teid_start=1073741824 teid_end=1073807359 IP=version 4, 101a8c0/255 UEIP=version 4, ac/0
                              SDF proto=0 from=IP version 4, 0/0:0-65535 to= IP version 4, 0/0:0-65535
                              [2026-04-04 02:24:47:499810][1433225472][DOCA][INF][UPF_ACCEL][upf_accel_json_parser.c:920][upf_accel_far_parse] Parsed FAR id=1:
                              Outer Header ip=0/0
                              [2026-04-04 02:24:47:499817][1433225472][DOCA][INF][UPF_ACCEL][upf_accel_json_parser.c:920][upf_accel_far_parse] Parsed FAR id=2:
                              Outer Header ip=101a8c0/255
                              [2026-04-04 02:24:47:499825][1433225472][DOCA][INF][UPF_ACCEL][upf_accel_json_parser.c:1000][upf_accel_urr_parse] Parsed URR id=1 volume_quota_total_volume=4000000000[2026-04-04 02:24:47:499831][1433225472][DOCA][INF][UPF_ACCEL][upf_accel_json_parser.c:1000][upf_accel_urr_parse] Parsed URR id=2 volume_quota_total_volume=200000
                              
                              [2026-04-04 02:24:47:499841][1433225472][DOCA][INF][UPF_ACCEL][upf_accel_json_parser.c:1084][upf_accel_qer_parse] Parsed QER id=1
                              qfi=20
                              MBR dl=2000000000 ul=2000000000
                              
                              [2026-04-04 02:24:47:499850][1433225472][DOCA][INF][UPF_ACCEL][upf_accel_json_parser.c:1084][upf_accel_qer_parse] Parsed QER id=2
                              qfi=20
                              MBR dl=2000000000 ul=2000000000
                              
                              [2026-04-04 02:24:47:500926][1433225472][DOCA][WRN][FLOW][engine_model.c:88] adapting queue depth to 128.
                              [2026-04-04 02:24:49:303964][1433225472][DOCA][ERR][FLOW::DRIVER::RUNTIME][nv_hws_wrappers.c:162] failed to create matcher, err -95
                              [2026-04-04 02:24:49:304007][1433225472][DOCA][ERR][FLOW][hws_matcher.c:1129] failed to create matcher reference for port 0
                              [2026-04-04 02:24:49:304026][1433225472][DOCA][ERR][FLOW][hws_pipe_core.c:226] failed creating matcher for pipe core - rc=-95
                              [2026-04-04 02:24:49:304031][1433225472][DOCA][ERR][FLOW][hws_pipe_core.c:282] failed pushing pipe core - matcher creation failed rc=-95
                              [2026-04-04 02:24:49:304034][1433225472][DOCA][ERR][FLOW][hws_pipe_core.c:597] failed building pipe core - matcher alloc rc=-95
                              [2026-04-04 02:24:49:304049][1433225472][DOCA][ERR][FLOW][engine_pipe_basic.c:292] Failed to create basic pipe core, rc = -95
                              [2026-04-04 02:24:49:304054][1433225472][DOCA][ERR][FLOW][engine_pipe.c:818] failed creating pipe - submit failed rc=(-95)
                              [2026-04-04 02:24:49:305747][1433225472][DOCA][ERR][FLOW][doca_flow.c:1862] engine pipe creation failed, rc = -95
                              [2026-04-04 02:24:49:305774][1433225472][DOCA][ERR][UPF_ACCEL][upf_accel_pipeline.c:141][upf_accel_pipe_create] Failed to create UPF accel pipe: Operation not supported
                              [2026-04-04 02:24:49:305785][1433225472][DOCA][ERR][UPF_ACCEL][upf_accel_pipeline.c:1260][upf_accel_pipe_7t_create] Failed to create 7t pipe: Operation not supported
                              [2026-04-04 02:24:49:305790][1433225472][DOCA][ERR][UPF_ACCEL][upf_accel_pipeline.c:1865][upf_accel_pipeline_rx_create] Failed to create 7t inner IPv4 pipe: Operation not supported
                              [2026-04-04 02:24:49:305795][1433225472][DOCA][ERR][UPF_ACCEL][upf_accel_pipeline.c:2138][upf_accel_pipeline_create] Failed to create rx pipeline in port 0: Operation not supported
                              [2026-04-04 02:24:49:305800][1433225472][DOCA][ERR][UPF_ACCEL][upf_accel.c:1375][init_upf_accel] Failed to create pipeline: Operation not supported
                              [2026-04-04 02:24:51:343602][1433225472][DOCA][ERR][UPF_ACCEL][upf_accel.c:1881][main] init_upf_accel() encountered an error: Operation not supported
                              [2026-04-04 02:24:51:383440][1433225472][DOCA][WRN][DPDK_BRIDGE][doca_dpdk.cpp:562] DPDK dev already detached: 0
                              [2026-04-04 02:24:51:384573][1433225472][DOCA][WRN][DPDK_BRIDGE][doca_dpdk.cpp:562] DPDK dev already detached: 1
                              [2026-04-04 02:24:51:385668][1433225472][DOCA][INF][UPF_ACCEL][upf_accel.c:1912][main] UPF Acceleration app finished with errors
                              
                              

                             

                            Now, the above link explicitly says:

                            The Flex Parser Profile is a setting that enables flexible protocol parsing on NVIDIA NICs/DPUs. To enable GTP protocol support, set the Flex Parser Profile to 3 using mlxconfig. This configuration is mandatory and must be done manually in the system.

                            …. Changing this configuration using mlxconfig requires a system (cold) reboot for the changes to take effect.

                            I tried running the command, reboot the host, reboot the DPU, nothing worked. Here’s what’s happening:

                            When the code attempted to insert a rule matching a GTP Tunnel ID (tun.gtp_teid), the DOCA driver crashed with err -95 (Operation not supported) and failed to create matcher. That’s because the hardware Parser, the physical silicon block doesn’t by default understand GTP TEID. Because the hardware couldn’t parse the field, the Hardware Steering (HWS) engine rejected the matchers.

                             

                            So, as per our hypothesis, the following things have now blocked our progress:

                            1. FW version not upgrading (maybe because of the lack of a cold reboot).
                            2. Cannot enable FLEX Parser Profile 3 with a cold reboot.

                            It would be great if we can work together on this. Our development, testing and evaluation of the system is now stuck.

                             

                            Thanks,
                            Plabon Dutta

                             

                             

                             

                             

                             

                             

                             

                             

                             

                            #9649
                            Plabon Dutta
                            Participant

                               

                              Would like to add to my previous comment, my request to you would be to try and run the DOCA Accelerated UPF Application on Fabric. That would make things clear about the capabilities and issues.

                              Link: https://docs.nvidia.com/doca/sdk/doca-accelerated-upf-reference-application-guide/index.html

                              Thank you!

                              Plabon

                               

                               

                               

                              #9656
                              Mert Cevik
                              Moderator

                                Hello Plabon,

                                Can you please let me know if you could use the steps that I shared on April 1st (the attached PDF file) and the issues you indicated last week were resolved or not?

                                For the firmware update procedure, there is seems to be some discrepancy between the documentation and actual outcome from the DOCA framework. Without a cold-reboot of the server, new firmware cannot be activated. I will need to clarify this with Nvidia.

                                I haven’t read the UPF Reference Application Guide yet, but from your descriptions, I understand that you need to use newer firmware versions for some additional features (although under Test Environment and Setup, firmware is listed as 32.43.1014). Also, you need to enable the Flex Parser Profile. And both items require cold-reboot of the server.

                                Rebooting of the servers on the FABRIC Testbed is not a straightforward task as all resources are shared by users. I will see what I can do and let you know.

                                Best regards,
                                Mert

                                 

                                #9657
                                Plabon Dutta
                                Participant

                                  Hi Mert,

                                  Thanks a lot for your response.

                                  About the steps from the pdf, I was doing more or less the same. There was two things missing in my case. When I changed those, it started working:

                                  • I wasn’t doing node.config() from the nodebook. Now, once the DPU comes up after bfb-install, I do a node.config() and that brings back all the interfaces in a proper manner.
                                  • After reboot, my host wasn’t coming back up, probably because of the sequence of the reboot or something else.

                                  And because of these, I couldn’t see the actual representor devices on DPU side and couldn’t start the comm channel properly.

                                  Now, about the current issue, the DOCA doc for the accelerated UPF does say that it was tested on FW version 32.43. However, the Flex Profile issue remains as we cannot do a cold reboot.

                                  The FW update issue is there as well, maybe should not impact this example, but when we do the FW update via BFB Install, but it still says 32.43, that is a concern as well.

                                   

                                  Best regards,

                                  Plabon

                                   

                                   

                                   

                                Viewing 15 posts - 1 through 15 (of 23 total)
                                • You must be logged in to reply to this topic.