Forum Replies Created
-
AuthorPosts
-
Work is completed.
Same situation. Rebooted, devices attached.
I’m not sure what is causing this, worker node is not extremely loaded, but inside the VMs there seem to be mellanox driver issues. If you share some context about the actual experiment and traffic (generated/exchanged) we can try to understand and find a way to have it sustain reliably. Otherwise, I don’t have any clues right now. You can directly reach out if you prefer.
Both VMs were crashed. I’m attaching the console outputs.
console.7b4c35dd-c7d1-4d29-9ca0-c71d21e6089e-r-2-1
console.c834417a-7393-4cae-bd62-722358b6451f-r-2-3I restarted them, they are online. I also attached their PCI devices (IP addresses need to be re-assigned).
We are checking on the status information for cern-w2 with respect to potential mismatch
due to a reservation that is currently consuming the resource but health of the reservation is not clear.
We will send updates.1 user thanked author for this post.
An easy way that works for me is checking the portal for the specific worker node’s resources. On the CERN, cern-w2 seems to be matching your needs. I will attach a screenshot from the portal but I’m not sure how it will show up on this comment, you can go to portal.fabric-testbed.net, click a link that leads to the CERN page (either from the map or from the table), then see the available resources. (if these are already known to you, then please disregard)

To target a specific worker node that has the desired resources, there may be some example functions within the example Jupyter notebooks that show filtering the worker nodes, and listing their resources. Or Fablib API documentation may reveal some ways, I don’t know much about that part. I guess knowledgable users from the community may share their methods.
For scheduling resources in advance, this resource may reveal some ways -> https://artifacts.fabric-testbed.net/artifacts/32938b00-5036-4a1e-84b5-063283618669
There may be some other ways to show the resource availabilities, but I will leave it to more advanced users or FABRIC team, they may have better pointers.
You need to provide the slice IDs.
ConnectX-6 SmartNICs are located on the “FastNet Worker”
GPUs are located on “GPU Worker” and “SlowNet Worker”You can find information on this page -> https://learn.fabric-testbed.net/knowledge-base/fabric-site-hardware-configurations/
So, it will not possible to have both GPU and ConnectX-6 on the same VM.
However, CERN is an exception. It has 3x “FastNet Worker” servers. Each server has 2x ConnectX-6 SmartNIC and 1x A30 GPU on them.MAX is available for the experiments.
April 21, 2026 at 12:34 pm in reply to: Inquiry Regarding MAX Site Maintenance Completion Timeline #9697Hi Ajay,
MAX is back online. Maintenance status is released but it may take some time to indicate the available status on the portal. Regardless it’s available for the experiments now.
Best regards,
MertProblem on the server (seat-w1) was caused by the Nvidia BlueField-3 DPU card. Currently, server is back online (active VM slivers are recovered), however we took out the DPU card for investigation. All other resources on the SEAT node are available for experiments.
Hi Ajay,
The problem is caused by a hardware failure on the head-node of the MAX site. Work is in progress to recover the server, however it’s very likely that it will require some extra time. I wanted to let you know in case these are the slices for your demo, you may need to re-create them on other FABRIC nodes/sites.
I will notify if we are able to resolve the probkem on MAX and your current slices can be recovered.
Best regards,
Mert
Hi Plabon,
BlueField-3 DPU on UCSD node is the one that you can test your work.
I’m attaching some outputs, but I confirmed that there is improvement on the Accelerated UPF Reference Application runtime. Please let us know about your status. (Also, due to the upcoming KNIT12, some other experimenters may be requesting the specifically the UCSD DPU resource. Please, follow the availability, and try out as soon as possible)
ubuntu@localhost:~$ sudo mlxfwmanager --query
Querying Mellanox devices firmware ...Device #1:
----------Device Type: BlueField3
Part Number: 900-9D3B6-00CC-EA_Ax
Description: NVIDIA BlueField-3 B3210E E-Series FHHL DPU; 100GbE (default mode) / HDR100 IB; Dual-port QSFP112; PCIe Gen5.0 x16 with x16 PCIe extension option; 16 Arm cores; 32GB on-board DDR; integrated BMC; Crypto Enabled
PSID: MT_0000001115
PCI Device Name: /dev/mst/mt41692_pciconf0
Base MAC: cc40f38f0356
Versions: Current Available
FW 32.48.1000 N/A
PXE 3.9.0101 N/A
UEFI 14.41.0014 N/A
UEFI Virtio blk 22.4.0014 N/A
UEFI Virtio net 21.4.0013 N/AStatus: No matching image found
ubuntu@localhost:~$ sudo mlxconfig -d 03:00.0 q
Device #1:
----------Device type: BlueField3
Name: 900-9D3B6-00CC-EA_Ax
Description: NVIDIA BlueField-3 B3210E E-Series FHHL DPU; 100GbE (default mode) / HDR100 IB; Dual-port QSFP112; PCIe Gen5.0 x16 with x16 PCIe extension option; 16 Arm cores; 32GB on-board DDR; integrated BMC; Crypto Enabled
Device: 03:00.0Configurations: Next Boot
. . .
FLEX_PARSER_PROFILE_ENABLE 3
. . .
Sorry for the trouble, and thank you. Google Drive linked worked well.
Hi Plabon,
Thank you for sharing the updates about my inquiry.
For the firmware and settings, I will notify later today or tomorrow morning.
For the attachment failure of the Jupyter notebook, following is the suggestion from the FABRIC team.
– rename the extension to txt to upload the notebook.Hello Plabon,
Can you please let me know if you could use the steps that I shared on April 1st (the attached PDF file) and the issues you indicated last week were resolved or not?
For the firmware update procedure, there is seems to be some discrepancy between the documentation and actual outcome from the DOCA framework. Without a cold-reboot of the server, new firmware cannot be activated. I will need to clarify this with Nvidia.
I haven’t read the UPF Reference Application Guide yet, but from your descriptions, I understand that you need to use newer firmware versions for some additional features (although under Test Environment and Setup, firmware is listed as 32.43.1014). Also, you need to enable the Flex Parser Profile. And both items require cold-reboot of the server.
Rebooting of the servers on the FABRIC Testbed is not a straightforward task as all resources are shared by users. I will see what I can do and let you know.
Best regards,
Mert -
AuthorPosts