Forum Replies Created
-
AuthorPosts
-
Hi Plabon,
I compiled a document from my notes and I’m attaching it as a PDF file. It includes all steps and versions that I used. You should be able to execute all commands as they are and be able to repeat the same setup.
I tried to verify the document, but due to time constraints, I may switch to actual work and my responses may be delayed, therefore I’m sharing this right away. I will be following the thread for updates/errors.
Thank you, I can see the files that you uploaded. I will check them out.
I’m posting the outputs from my slice below:
1. Start the server process on the DPU
(server waits on client, then the connection is established following the client process is started on the host)ubuntu@localhost:~$ cd /tmp/build/secure_channel/
ubuntu@localhost:/tmp/build/secure_channel$ sudo ./doca_secure_channel -s 256 -n 10 -p 03:00.0 -r 81:00.0
[00:33:01:945954][509848][DOCA][INF][comch_utils.c:464][comch_utils_fast_path_init] Server waiting on a client to connect[00:33:26:620828][509848][DOCA][INF][comch_utils.c:472][comch_utils_fast_path_init] Server connection established
[00:33:26:700884][509848][DOCA][INF][secure_channel_core.c:1012][sc_start] Producer sent 10 messages in approximately 0.0865 milliseconds
[00:33:26:700914][509848][DOCA][INF][secure_channel_core.c:1015][sc_start] Consumer received 10 messages in approximately 0.0019 milliseconds
ubuntu@localhost:/tmp/build/secure_channel$2. Start the client process on the Host
ubuntu@Node1:/tmp/build/secure_channel$ sudo ./doca_secure_channel -s 256 -n 10 -p 07:00.0
[00:33:55:727260][4244][DOCA][INF][secure_channel_core.c:1012][sc_start] Producer sent 10 messages in approximately 0.0094 milliseconds
[00:33:55:727284][4244][DOCA][INF][secure_channel_core.c:1015][sc_start] Consumer received 10 messages in approximately 0.0038 millisecondsI’m also attaching two txt files that show the output (versions, devices etc) from the DPU and Host. (I realized that the attachments are visible when you’re logged into the forum section, without login, the page does not indicate any attachments)
Hello Plabon,
If you’re considering sharing the output from your system with us, can you also include the OS version?
On my test setup on the FABRIC Testbed, Ubuntu 24 consistently gave me an error with loading mlx5_ib module, however on a Ubuntu 22 “host” setup, it worked well and I can see the DOCA devices. It will be helpful for us to get some information from a reference system, if you share the information from your side.
-
This reply was modified 2 months ago by
Mert Cevik.
Hello Plabon,
Whether there is any way for us to request access to a physical bare-metal node for BlueField/DPU testing.
FABRIC Testbed has only VM resources and does not provide physical bare-metal nodes for BlueField/DPU testing.
Whether the current FABRIC VM setup fully supports host-side BlueField DOCA communication-channel use cases.
We don’t have a specific statement that FABRIC Testbed fully supports host-side BlueField DOCA communication-channel use cases. However, as you already know, on FABRIC Testbed, you can create VMs and the network cards are attached via PCI passthrough.
For potential “low-level firmware, driver binding, or host-exposure issue in the current FABRIC setup”, we can work with you to identify the problems. We need some information from your “healthy setup” with respect to the DOCA (host and DPU) versions, firmware version from the DPU (specifically the output from
flint -d MST_DEVICE q. It will be also helpful if you share the outputs for the following items:- lspci (both host and DPU)
- doca_caps –list-devs (both host and DPU)
- doca_caps –list-rep-devs (from the DPU)
- mlxconfig -d <MST_DEVICE> q INTERNAL_CPU_OFFLOAD_ENGINE (both host and DPU)
-
This reply was modified 2 months ago by
Mert Cevik.
-
This reply was modified 2 months ago by
Mert Cevik.
March 26, 2026 at 11:31 am in reply to: L2Bridge not forwarding frames between NIC_ConnectX_6 ports #9610Hello Mounika,
I tried to find which slice this is and I’m guessing it’s Slice ID: c2a39f8b-8278-4bbd-a251-2eb42b1c5d65
(If not, please indicate your slice ID)I want to point out a few items that can be useful.
First, the topology on the slice that I mentioned above
– 2x VMs running on the (same) host/worker brist-w2, each one with a dedicated 100G CX6 card and connected over a L2Bridge)
should work fine to pass traffic on the dataplane.I tested a similar slice topology on CLEM node and confirmed that traffic worked well, so there shouldn’t be a limitation when the VMs are placed on the same host. I deleted my test slice on CLEM to release the two 100G dedicated CX6 NICs, if you prefer, you can re-create your slice on CLEM and we can see how it works.
Alternatively, you can try Meshal’s suggestion and place the VMs on different hosts/workers. Specifically for BRIST node, this can be possible if you choose NIC_ConnectX_6 for one VM and NIC_ConnectX_5 for the other VM.
I want to point out this page https://learn.fabric-testbed.net/knowledge-base/fabric-site-hardware-configurations/
that includes information about the hardware configurations of the FABRIC sites/nodes. FastNet and SlowNet worker elements have the dedicated NICs on them (note CX6 and CX5 types). I also want to share that all sites/nodes (except CERN) have only one FastNet worker.This maintenance is completed.
Hi Lorenzo,
Your VM was crashed (out of memory). I rebooted the VM, it should be reachable for you. I’m attaching the console output as well in case you find useful information out of that. console-e3cfe65c-0a31-4d43-9ea4-526fa17ec7e6
Hello Tanay,
VM “node3-dpu” is showing a crash status. I’m attaching the section from the console – console-node3-dpu
March 10, 2026 at 1:02 pm in reply to: SSH error: channel 0: open failed: connect failed: No route to host #9571Hi Meshal,
The slice you indicated has slivers on the SALT node and we had a power outage at 7am ET today. Currently, all slivers are recovered and they are online. Specifically for the SALT node, we have been having issues with power outages recently. Our abilities are very limited for the SALT node, but we are actively searching for options that can remediate. If the other occurrences of connectivity problems that you mentioned were slices on the SALT node, then it’s likely that the previous power outages were the cause.
Please let us know when you have such connectivity issues, and we will check and work with you promptly.
Hi Tanay,
As a next step, we can try cold-rebooting the server that is holding the DPU, however this is not possible when other users have VM slivers running on it. I need to make special arrangements for that.
On our Development environment, we have a BlueField-2 DPU and we can perform all kinds of trials on it. You pointed the web page that describes how the configuration steps, but it can be even better if you provide us a complete list of commands for this configuration, so we can test it on the Development site. If there is any variance across BlueField-2 and BlueField-3, it will be good to indicate as well. Even, currently I’m preparing for additional BlueField-3 integrations, so I have BlueField-3 cards just delivered and I can use one card and test on the Development site with a BlueField-3 later.
And lastly, on the web page under How-Plug Firmware Configuration section, there is a note as “Hotplug is not guaranteed to work on AMD machines.” Servers on the FABRIC Testbed infrastructure are all AMD-based Dell R7525 servers. I’m not sure if this may be relevant to our issue.
Best regards,
MertHi Tanay,
I performed a power reset for the DPU. Can you please check if that worked well for the firmware configuration change?
ubuntu@localhost:~$ uname -a
Linux localhost.localdomain 5.15.0-1065-bluefield #67-Ubuntu SMP Tue Apr 22 11:10:15 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux
ubuntu@localhost:~$ uptime
16:19:21 up 1 min, 1 user, load average: 6.83, 2.15, 0.75
I will be able to describe the details about how I performed this later. Mainly, I had included the BMC bindings to the DPU integration, and I utilized this path, however I’m not sure very much sure about the terminology or specifics, just some intuitive actions so far. I’m also in touch with the FABRIC team about this item, so your input about the progress will be helpful for our further enhancements.
-
This reply was modified 3 months ago by
Mert Cevik.
-
This reply was modified 3 months ago by
Mert Cevik.
DPU on the SEAT node is recovered and it can be used for experiments.
For the firmware configuration, I need to read the documentation. I have no prior experience with these cards.
Hello Tanay,
Can you share the state of your slice and slivers from your point of view? All slivers of the slice seem to be deleted.
Best regards,
MertSo, since you’re able to login to this problematic VM from other sources, then you can check and make sure the right SSH key is inside the VM. I just placed my SSH key in it, and I could login properly. Please let us know about the status following your SSH key check and I will take a look further.
If I understand the problem from the description correctly (“manual connect”), you’re trying to connect to the VM(s) from a terminal on your computer/laptop and getting the error. If that’s the case, you need to set up your ssh client configuration file and ssh keys properly (in your computer/laptop) and connect. This page can be helpful -> https://learn.fabric-testbed.net/knowledge-base/logging-into-fabric-vms/
If my understanding is wrong and problem is something different, please disregard the info above.
-
This reply was modified 2 months ago by
-
AuthorPosts