Mert Cevik

Forum Replies Created

Viewing 15 posts - 1 through 15 (of 222 total)

1 2 3 … 13 14 15 →

Author

Posts
May 8, 2026 at 3:48 pm in reply to: Maintenance on SEAT node on May 8th #9774
Mert Cevik
Moderator
Work is completed.
May 1, 2026 at 1:22 pm in reply to: I cannot access some of my nodes #9741
Mert Cevik
Moderator
Same situation. Rebooted, devices attached.

I’m not sure what is causing this, worker node is not extremely loaded, but inside the VMs there seem to be mellanox driver issues. If you share some context about the actual experiment and traffic (generated/exchanged) we can try to understand and find a way to have it sustain reliably. Otherwise, I don’t have any clues right now. You can directly reach out if you prefer.
May 1, 2026 at 12:03 pm in reply to: I cannot access some of my nodes #9739
Mert Cevik
Moderator
Both VMs were crashed. I’m attaching the console outputs.
console.7b4c35dd-c7d1-4d29-9ca0-c71d21e6089e-r-2-1
console.c834417a-7393-4cae-bd62-722358b6451f-r-2-3

I restarted them, they are online. I also attached their PCI devices (IP addresses need to be re-assigned).
April 24, 2026 at 12:18 pm in reply to: Cannot allocate GPU + ConnectX-6 on same node #9724
Mert Cevik
Moderator
We are checking on the status information for cern-w2 with respect to potential mismatch
due to a reservation that is currently consuming the resource but health of the reservation is not clear.
We will send updates.

1 user thanked author for this post.

Bekmukhamed Tursunbayev
April 23, 2026 at 9:40 pm in reply to: Cannot allocate GPU + ConnectX-6 on same node #9722
Mert Cevik
Moderator
An easy way that works for me is checking the portal for the specific worker node’s resources. On the CERN, cern-w2 seems to be matching your needs. I will attach a screenshot from the portal but I’m not sure how it will show up on this comment, you can go to portal.fabric-testbed.net, click a link that leads to the CERN page (either from the map or from the table), then see the available resources. (if these are already known to you, then please disregard)

To target a specific worker node that has the desired resources, there may be some example functions within the example Jupyter notebooks that show filtering the worker nodes, and listing their resources. Or Fablib API documentation may reveal some ways, I don’t know much about that part. I guess knowledgable users from the community may share their methods.

For scheduling resources in advance, this resource may reveal some ways -> https://artifacts.fabric-testbed.net/artifacts/32938b00-5036-4a1e-84b5-063283618669

There may be some other ways to show the resource availabilities, but I will leave it to more advanced users or FABRIC team, they may have better pointers.
April 23, 2026 at 6:15 pm in reply to: Issue Accessing Nodes Across My FABRIC Slices #9719
Mert Cevik
Moderator
You need to provide the slice IDs.
April 23, 2026 at 6:14 pm in reply to: Cannot allocate GPU + ConnectX-6 on same node #9718
Mert Cevik
Moderator
ConnectX-6 SmartNICs are located on the “FastNet Worker”
GPUs are located on “GPU Worker” and “SlowNet Worker”

You can find information on this page -> https://learn.fabric-testbed.net/knowledge-base/fabric-site-hardware-configurations/

So, it will not possible to have both GPU and ConnectX-6 on the same VM.
However, CERN is an exception. It has 3x “FastNet Worker” servers. Each server has 2x ConnectX-6 SmartNIC and 1x A30 GPU on them.
April 21, 2026 at 12:35 pm in reply to: Slices stuck at configuring…….state #9698
Mert Cevik
Moderator
MAX is available for the experiments.
April 21, 2026 at 12:34 pm in reply to: Inquiry Regarding MAX Site Maintenance Completion Timeline #9697
Mert Cevik
Moderator
Hi Ajay,

MAX is back online. Maintenance status is released but it may take some time to indicate the available status on the portal. Regardless it’s available for the experiments now.

Best regards,
Mert
April 16, 2026 at 8:29 pm in reply to: FABRIC SEAT – Outage on seat-w1 #9690
Mert Cevik
Moderator
Problem on the server (seat-w1) was caused by the Nvidia BlueField-3 DPU card. Currently, server is back online (active VM slivers are recovered), however we took out the DPU card for investigation. All other resources on the SEAT node are available for experiments.
April 14, 2026 at 7:22 pm in reply to: Slices stuck at configuring…….state #9687
Mert Cevik
Moderator
Hi Ajay,

The problem is caused by a hardware failure on the head-node of the MAX site. Work is in progress to recover the server, however it’s very likely that it will require some extra time. I wanted to let you know in case these are the slices for your demo, you may need to re-create them on other FABRIC nodes/sites.

I will notify if we are able to resolve the probkem on MAX and your current slices can be recovered.

Best regards,

Mert
April 7, 2026 at 12:14 pm in reply to: BlueField-3 host-DPU communication issue on FABRIC #9664
Mert Cevik
Moderator
Hi Plabon,

BlueField-3 DPU on UCSD node is the one that you can test your work.

I’m attaching some outputs, but I confirmed that there is improvement on the Accelerated UPF Reference Application runtime. Please let us know about your status. (Also, due to the upcoming KNIT12, some other experimenters may be requesting the specifically the UCSD DPU resource. Please, follow the availability, and try out as soon as possible)

ubuntu@localhost:~$ sudo mlxfwmanager --query Querying Mellanox devices firmware ... Device #1: ---------- Device Type: BlueField3 Part Number: 900-9D3B6-00CC-EA_Ax Description: NVIDIA BlueField-3 B3210E E-Series FHHL DPU; 100GbE (default mode) / HDR100 IB; Dual-port QSFP112; PCIe Gen5.0 x16 with x16 PCIe extension option; 16 Arm cores; 32GB on-board DDR; integrated BMC; Crypto Enabled PSID: MT_0000001115 PCI Device Name: /dev/mst/mt41692_pciconf0 Base MAC: cc40f38f0356 Versions: Current Available FW 32.48.1000 N/A PXE 3.9.0101 N/A UEFI 14.41.0014 N/A UEFI Virtio blk 22.4.0014 N/A UEFI Virtio net 21.4.0013 N/A Status: No matching image found ubuntu@localhost:~$ sudo mlxconfig -d 03:00.0 q Device #1: ---------- Device type: BlueField3 Name: 900-9D3B6-00CC-EA_Ax Description: NVIDIA BlueField-3 B3210E E-Series FHHL DPU; 100GbE (default mode) / HDR100 IB; Dual-port QSFP112; PCIe Gen5.0 x16 with x16 PCIe extension option; 16 Arm cores; 32GB on-board DDR; integrated BMC; Crypto Enabled Device: 03:00.0 Configurations: Next Boot . . . FLEX_PARSER_PROFILE_ENABLE 3 . . .
April 5, 2026 at 4:34 pm in reply to: BlueField-3 host-DPU communication issue on FABRIC #9663
Mert Cevik
Moderator
Sorry for the trouble, and thank you. Google Drive linked worked well.
April 5, 2026 at 2:55 pm in reply to: BlueField-3 host-DPU communication issue on FABRIC #9660
Mert Cevik
Moderator
Hi Plabon,

Thank you for sharing the updates about my inquiry.

For the firmware and settings, I will notify later today or tomorrow morning.

For the attachment failure of the Jupyter notebook, following is the suggestion from the FABRIC team.
– rename the extension to txt to upload the notebook.
April 5, 2026 at 1:07 am in reply to: BlueField-3 host-DPU communication issue on FABRIC #9656
Mert Cevik
Moderator
Hello Plabon,

Can you please let me know if you could use the steps that I shared on April 1st (the attached PDF file) and the issues you indicated last week were resolved or not?

For the firmware update procedure, there is seems to be some discrepancy between the documentation and actual outcome from the DOCA framework. Without a cold-reboot of the server, new firmware cannot be activated. I will need to clarify this with Nvidia.

I haven’t read the UPF Reference Application Guide yet, but from your descriptions, I understand that you need to use newer firmware versions for some additional features (although under Test Environment and Setup, firmware is listed as 32.43.1014). Also, you need to enable the Flex Parser Profile. And both items require cold-reboot of the server.

Rebooting of the servers on the FABRIC Testbed is not a straightforward task as all resources are shared by users. I will see what I can do and let you know.

Best regards,
Mert
Author

Posts

Viewing 15 posts - 1 through 15 (of 222 total)

1 2 3 … 13 14 15 →

Forum Replies Created

1 user thanked author for this post.