1. Mert Cevik

Mert Cevik

Forum Replies Created

Viewing 15 posts - 1 through 15 (of 213 total)
  • Author
    Posts
  • in reply to: FABRIC SEAT – Outage on seat-w1 #9690
    Mert Cevik
    Moderator

      Problem on the server (seat-w1) was caused by the Nvidia BlueField-3 DPU card. Currently, server is back online (active VM slivers are recovered), however we took out the DPU card for investigation. All other resources on the SEAT node are available for experiments.

      in reply to: Slices stuck at configuring…….state #9687
      Mert Cevik
      Moderator

        Hi Ajay,

        The problem is caused by a hardware failure on the head-node of the MAX site. Work is in progress to recover the server, however it’s very likely that it will require some extra time. I wanted to let you know in case these are the slices for your demo, you may need to re-create them on other FABRIC nodes/sites.

        I will notify if we are able to resolve the probkem on MAX and your current slices can be recovered.

         

        Best regards,

        Mert

        in reply to: BlueField-3 host-DPU communication issue on FABRIC #9664
        Mert Cevik
        Moderator

          Hi Plabon,

          BlueField-3 DPU on UCSD node is the one that you can test your work.

          I’m attaching some outputs, but I confirmed that there is improvement on the Accelerated UPF Reference Application runtime. Please let us know about your status. (Also, due to the upcoming KNIT12, some other experimenters may be requesting the specifically the UCSD DPU resource. Please, follow the availability, and try out as soon as possible)

          ubuntu@localhost:~$ sudo mlxfwmanager --query
          Querying Mellanox devices firmware ...

          Device #1:
          ----------

          Device Type: BlueField3
          Part Number: 900-9D3B6-00CC-EA_Ax
          Description: NVIDIA BlueField-3 B3210E E-Series FHHL DPU; 100GbE (default mode) / HDR100 IB; Dual-port QSFP112; PCIe Gen5.0 x16 with x16 PCIe extension option; 16 Arm cores; 32GB on-board DDR; integrated BMC; Crypto Enabled
          PSID: MT_0000001115
          PCI Device Name: /dev/mst/mt41692_pciconf0
          Base MAC: cc40f38f0356
          Versions: Current Available
          FW 32.48.1000 N/A
          PXE 3.9.0101 N/A
          UEFI 14.41.0014 N/A
          UEFI Virtio blk 22.4.0014 N/A
          UEFI Virtio net 21.4.0013 N/A

          Status: No matching image found

          ubuntu@localhost:~$ sudo mlxconfig -d 03:00.0 q
          Device #1:
          ----------

          Device type: BlueField3
          Name: 900-9D3B6-00CC-EA_Ax
          Description: NVIDIA BlueField-3 B3210E E-Series FHHL DPU; 100GbE (default mode) / HDR100 IB; Dual-port QSFP112; PCIe Gen5.0 x16 with x16 PCIe extension option; 16 Arm cores; 32GB on-board DDR; integrated BMC; Crypto Enabled
          Device: 03:00.0

          Configurations: Next Boot

          . . .

          FLEX_PARSER_PROFILE_ENABLE 3

          . . .

          in reply to: BlueField-3 host-DPU communication issue on FABRIC #9663
          Mert Cevik
          Moderator

            Sorry for the trouble, and thank you. Google Drive linked worked well.

            in reply to: BlueField-3 host-DPU communication issue on FABRIC #9660
            Mert Cevik
            Moderator

              Hi Plabon,

              Thank you for sharing the updates about my inquiry.

              For the firmware and settings, I will notify later today or tomorrow morning.

              For the attachment failure of the Jupyter notebook, following is the suggestion from the FABRIC team.
              – rename the extension to txt to upload the notebook.

              in reply to: BlueField-3 host-DPU communication issue on FABRIC #9656
              Mert Cevik
              Moderator

                Hello Plabon,

                Can you please let me know if you could use the steps that I shared on April 1st (the attached PDF file) and the issues you indicated last week were resolved or not?

                For the firmware update procedure, there is seems to be some discrepancy between the documentation and actual outcome from the DOCA framework. Without a cold-reboot of the server, new firmware cannot be activated. I will need to clarify this with Nvidia.

                I haven’t read the UPF Reference Application Guide yet, but from your descriptions, I understand that you need to use newer firmware versions for some additional features (although under Test Environment and Setup, firmware is listed as 32.43.1014). Also, you need to enable the Flex Parser Profile. And both items require cold-reboot of the server.

                Rebooting of the servers on the FABRIC Testbed is not a straightforward task as all resources are shared by users. I will see what I can do and let you know.

                Best regards,
                Mert

                 

                in reply to: BlueField-3 host-DPU communication issue on FABRIC #9636
                Mert Cevik
                Moderator

                  Hi Plabon,

                  I compiled a document from my notes and I’m attaching it as a PDF file. It includes all steps and versions that I used. You should be able to execute all commands as they are and be able to repeat the same setup.

                  I tried to verify the document, but due to time constraints, I may switch to actual work and my responses may be delayed, therefore I’m sharing this right away. I will be following the thread for updates/errors.

                   

                  in reply to: BlueField-3 host-DPU communication issue on FABRIC #9626
                  Mert Cevik
                  Moderator

                    Thank you, I can see the files that you uploaded. I will check them out.

                    I’m posting the outputs from my slice below:

                    1. Start the server process on the DPU
                    (server waits on client, then the connection is established following the client process is started on the host)

                    ubuntu@localhost:~$ cd /tmp/build/secure_channel/
                    ubuntu@localhost:/tmp/build/secure_channel$ sudo ./doca_secure_channel -s 256 -n 10 -p 03:00.0 -r 81:00.0
                    [00:33:01:945954][509848][DOCA][INF][comch_utils.c:464][comch_utils_fast_path_init] Server waiting on a client to connect

                     

                    [00:33:26:620828][509848][DOCA][INF][comch_utils.c:472][comch_utils_fast_path_init] Server connection established
                    [00:33:26:700884][509848][DOCA][INF][secure_channel_core.c:1012][sc_start] Producer sent 10 messages in approximately 0.0865 milliseconds
                    [00:33:26:700914][509848][DOCA][INF][secure_channel_core.c:1015][sc_start] Consumer received 10 messages in approximately 0.0019 milliseconds
                    ubuntu@localhost:/tmp/build/secure_channel$

                    2. Start the client process on the Host

                    ubuntu@Node1:/tmp/build/secure_channel$ sudo ./doca_secure_channel -s 256 -n 10 -p 07:00.0
                    [00:33:55:727260][4244][DOCA][INF][secure_channel_core.c:1012][sc_start] Producer sent 10 messages in approximately 0.0094 milliseconds
                    [00:33:55:727284][4244][DOCA][INF][secure_channel_core.c:1015][sc_start] Consumer received 10 messages in approximately 0.0038 milliseconds

                    I’m also attaching two txt files that show the output (versions, devices etc) from the DPU and Host. (I realized that the attachments are visible when you’re logged into the forum section, without login, the page does not indicate any attachments)

                     

                    in reply to: BlueField-3 host-DPU communication issue on FABRIC #9621
                    Mert Cevik
                    Moderator

                      Hello Plabon,

                      If you’re considering sharing the output from your system with us, can you also include the OS version?

                      On my test setup on the FABRIC Testbed, Ubuntu 24 consistently gave me an error with loading mlx5_ib module, however on a Ubuntu 22 “host” setup, it worked well and I can see the DOCA devices. It will be helpful for us to get some information from a reference system, if you share the information from your side.

                      • This reply was modified 2 weeks, 5 days ago by Mert Cevik.
                      in reply to: BlueField-3 host-DPU communication issue on FABRIC #9616
                      Mert Cevik
                      Moderator

                        Hello Plabon,

                        Whether there is any way for us to request access to a physical bare-metal node for BlueField/DPU testing.

                        FABRIC Testbed has only VM resources and does not provide physical bare-metal nodes for BlueField/DPU testing.

                        Whether the current FABRIC VM setup fully supports host-side BlueField DOCA communication-channel use cases.

                        We don’t have a specific statement that FABRIC Testbed fully supports host-side BlueField DOCA communication-channel use cases. However, as you already know, on FABRIC Testbed, you can create VMs and the network cards are attached via PCI passthrough.

                        For potential “low-level firmware, driver binding, or host-exposure issue in the current FABRIC setup”, we can work with you to identify the problems. We need some information from your “healthy setup” with respect to the DOCA (host and DPU) versions, firmware version from the DPU (specifically the output from flint -d MST_DEVICE q. It will be also helpful if you share the outputs for the following items:

                        • lspci (both host and DPU)
                        • doca_caps –list-devs (both host and DPU)
                        • doca_caps –list-rep-devs (from the DPU)
                        • mlxconfig -d <MST_DEVICE> q INTERNAL_CPU_OFFLOAD_ENGINE (both host and DPU)

                         

                        • This reply was modified 2 weeks, 5 days ago by Mert Cevik.
                        • This reply was modified 2 weeks, 5 days ago by Mert Cevik.
                        in reply to: L2Bridge not forwarding frames between NIC_ConnectX_6 ports #9610
                        Mert Cevik
                        Moderator

                          Hello Mounika,

                          I tried to find which slice this is and I’m guessing it’s Slice ID: c2a39f8b-8278-4bbd-a251-2eb42b1c5d65
                          (If not, please indicate your slice ID)

                          I want to point out a few items that can be useful.

                          First, the topology on the slice that I mentioned above
                          – 2x VMs running on the (same) host/worker brist-w2, each one with a dedicated 100G CX6 card and connected over a L2Bridge)
                          should work fine to pass traffic on the dataplane.

                          I tested a similar slice topology on CLEM node and confirmed that traffic worked well, so there shouldn’t be a limitation when the VMs are placed on the same host. I deleted my test slice on CLEM to release the two 100G dedicated CX6 NICs, if you prefer, you can re-create your slice on CLEM and we can see how it works.

                          Alternatively, you can try Meshal’s suggestion and place the VMs on different hosts/workers. Specifically for BRIST node, this can be possible if you choose  NIC_ConnectX_6 for one VM and NIC_ConnectX_5 for the other VM.

                          I want to point out this page https://learn.fabric-testbed.net/knowledge-base/fabric-site-hardware-configurations/
                          that includes information about the hardware configurations of the FABRIC sites/nodes. FastNet and SlowNet worker elements have the dedicated NICs on them (note CX6 and CX5 types). I also want to share that all sites/nodes (except CERN) have only one FastNet worker.

                           

                           

                          in reply to: Maintenance on UCSD on March 18th, 2026 #9599
                          Mert Cevik
                          Moderator

                            This maintenance is completed.

                            in reply to: Slice Unreachable Indefinitely #9583
                            Mert Cevik
                            Moderator

                              Hi Lorenzo,

                              Your VM was crashed (out of memory). I rebooted the VM, it should be reachable for you. I’m attaching the console output as well in case you find useful information out of that. console-e3cfe65c-0a31-4d43-9ea4-526fa17ec7e6

                              in reply to: Unable to SSH to one node in a 3-node slice #9574
                              Mert Cevik
                              Moderator

                                Hello Tanay,

                                VM “node3-dpu” is showing a crash status. I’m attaching the section from the console – console-node3-dpu

                                 

                                Mert Cevik
                                Moderator

                                  Hi Meshal,

                                  The slice you indicated has slivers on the SALT node and we had a power outage at 7am ET today. Currently, all slivers are recovered and they are online. Specifically for the SALT node, we have been having issues with power outages recently. Our abilities are very limited for the SALT node, but we are actively searching for options that can remediate. If the other occurrences of connectivity problems that you mentioned were slices on the SALT node, then it’s likely that the previous power outages were the cause.

                                  Please let us know when you have such connectivity issues, and we will check and work with you promptly.

                                Viewing 15 posts - 1 through 15 (of 213 total)