Forum Replies Created
-
AuthorPosts
-
Hi Fatih,
I see that the following three slivers are currently in a Closed state. Please note that a renew is not a single-shot operation.
When you renew a slice, it transitions into the Configuring state and reports which individual slivers were successfully extended and which were not. You can verify this in the Portal by viewing the slice topology, or—if you are renewing from JupyterHub—fablib will explicitly report which slivers failed to renew.
You can also check this programmatically:
slice = fablib.get_slice(slice_name) slice.list_slivers()Here are the affected reservations:
- Reservation ID: 990127bd-aa06-4992-8847-c76654faf0e8
State: Closed
Reason: Insufficient resources — No path available with the requested QoS - Reservation ID: 30dd426f-9ddc-424b-bec7-ca8631540ea4
State: Closed
Reason: Insufficient resources — No path available with the requested QoS - Reservation ID: cb4372e4-fb05-454e-8662-f53e297689f8
State: Closed
Reason: Insufficient resources — No path available with the requested QoS
These slivers were not able to secure a viable path during renewal, which is why they are now in a closed state.
To re-add these network services, you can modify the slice as follows:
- Fetch the current slice topology, remove the closed network services, and submit the slice.
- Fetch the updated topology, add the required network services again, and submit once more.
- You can refer to this example for guidance on modifying an existing slice (adding/removing resources):
fabric_examples/fablib_api/modify_slice/modify-add-node-network.ipynb
Please let me know if you’d like help with the modify workflow or with re-submitting the network services.
Best,
KomalHi Fatih,
Thanks for reaching out.
I looked into your slice, and it appears that the two network services associated with VLAN 300 and VLAN 600 are currently in a Closed state. Both reservations show the same ticket update:
“Insufficient resources: No path available with the requested QoS.”
Here are the details:
Reservation ID: 8a83db0f-03f1-44b0-843f-c6e0c2664cfe
Slice ID: fdf2fd5b-b1b0-46ef-b51a-4d55e0fd5c47
Resource Type: L2PTP
State: Closed
Reason: No path available with requested QoSReservation ID: 257fae2a-28ca-4430-bb85-77864b3d5c25
Slice ID: fdf2fd5b-b1b0-46ef-b51a-4d55e0fd5c47
Resource Type: L2PTP
State: Closed
Reason: No path available with requested QoSThis indicates that the system was unable to allocate a viable path for these two tunnels during your most recent renewal window, which is why they are not active now.
If you would like, you can try the following:
- Re-declare or re-submit these two network services in your slice.
- Lower the QoS requirement temporarily to see if a path becomes available.
Please feel free to reach out if you need help updating the slice or if you would like us to investigate further.
Best regards,
KomalNovember 20, 2025 at 11:08 am in reply to: Cannot SSH into NS1 and NS5 nodes, need to preserve data (PhD simulations) #9189Hi Danilo,
I found that the
authorized_keysfile on both NS1 and NS5 was empty, which is why SSH—whether through the admin key or the Control Framework—was failing resulting in POA/addKey failure. It seems this may have happened unintentionally as part of the experiment.I’ve manually restored SSH access so the Control Framework should now function properly, including POA. Could you please try adding your keys to these VMs again using POA? That should re-establish your SSH access.
Please be careful not to remove or overwrite the
authorized_keysfile in the process.Best,
Komal
I tried running
docker pullmanually on DALL and SEAT, and it worked fine on both. The artifact also ran successfully on SEAT with following changes. The issue appears to be related to the Docker installation viadocker.io.I have also passed this to the artifact author so they can make the required updates.
I made the following changes to get the artifact working:
- Changed the image to
docker_ubuntu_24. - Updated Step 34 to remove
docker.iofrom the installation commands.
stdout, stderr = node1.execute('sudo apt-get update', quiet=True) stdout, stderr = node1.execute('sudo apt-get install -y build-essential python3-pip net-tools', quiet=True) stdout, stderr = node2.execute('sudo apt-get update', quiet=True) stdout, stderr = node2.execute('sudo apt-get install -y build-essential python3-pip net-tools', quiet=True) stdout, stderr = node1.execute('sudo pip3 install meson ninja', quiet=True) stdout, stderr = node2.execute('sudo apt install -y python3-scapy', quiet=True)Best,
KomalHi Nishanth,
I tried on UTAH, MICH, MASS and docker pull seems to work.
Could you please try
nslookup nvcr.ioand then try the docker pull command?I will also check with Mert/Hussam to see if we have any known issues on SEAT and DALL.
Best,
Komal
Hi Nishanth,
Could you please share which Site is your slice running at?
Best,
Komal
Hi Paresh,
Currently, FABRIC allows users to create VMs where GPUs or FPGAs can be attached via PCI passthrough. However, direct communication between FPGA and GPU over PCIe (such as peer-to-peer DMA or RDMA transfers) is not supported.
This is because for true PCIe peer-to-peer access, both devices need to be physically located on the same host and share the same PCIe root complex or switch. At present, none of the FABRIC nodes have both a GPU and an FPGA installed on the same host.
If you’d like to double-check inventory yourself, you can list host capabilities with
fablib:from fabrictestbed_extensions.fablib.fablib import FablibManager as fablib_manager fields = [ 'name', 'fpga_sn1022_capacity', 'fpga_u280_capacity', 'rtx6000_capacity', 'tesla_t4_capacity', 'a30_capacity', 'a40_capacity' ] fablib = fablib_manager() output_table = fablib.list_hosts(fields=fields)You’ll see per-host capacities for each device type. It will show that hosts with FPGA capacity don’t also list GPU capacity (and vice versa), confirming that GPU+FPGA co-location isn’t available.
Best regards,
KomalNovember 5, 2025 at 3:06 pm in reply to: Clarification on Multiple L2PTP Tunnels over a Single Physical Link #9153Hi Fatih,
You should be able to create multiple tunnels on the same NIC by using VLAN-tagged sub-interfaces. Each sub-interface can be assigned to a different L2PTP tunnel, allowing multiple distinct connections over the same physical port.
Please check out the example notebook
fabric_examples/fablib_api/sub_interfaces/sub_interfaces.ipynbfor details on how to configure sub-interfaces.Best regards,
KomalNovember 4, 2025 at 3:30 pm in reply to: Unable to allocate IP addresses to nodes – “No Management IP” #9147Hi Geoff,
This appears to be a bug in fablib. As a workaround, could you please modify the call as follows?
client_interface = client_node.get_interface(network_name="client-net", refresh=True)This change should prevent the error from occurring. I’ll work on fixing this issue in fablib.
Best,
KomalHi Tanay,
BlueField-3 nodes are now available on FABRIC, and we currently offer two variants:
- ConnectX-7-100 – 100 G
- ConnectX-7-400 – 400 G
To provision and use them, your project lead will need to request access through the Portal under Experiment → Project → Request Permissions.
Best,
Komal1 user thanked author for this post.
November 4, 2025 at 3:10 pm in reply to: Unable to allocate IP addresses to nodes – “No Management IP” #9144Hi Geoff,
Just to confirm my understanding — your slice is in StableOK state, and the nodes display IP addresses as shown in your screenshot, but
node.executeis failing with a “no management IP” error. Is that correct?Could you please share your Slice ID here?
Thanks,
KomalThank you, @yoursunny, for sharing these observations and the detailed steps to reproduce them. This appears to be a bug. I’ll work on addressing it and will update you once the patch is deployed.
Best,Komal1 user thanked author for this post.
Hi Jiri,
We’ve been investigating two issues related to your recent observation:
- Slice reaches StableOK, but management IPs don’t appear – This behavior seems to be caused by performance degradation in our backend graph database. We’re actively working to address and mitigate this issue.
- Slice stuck in “doing post Boot Config” – This issue was traced to one of the bastion hosts. A fix for this has been applied earlier today.
If your slice is still active, could you please share the Slice ID where you observed this behavior? Additionally, if you encounter this issue again, it would be very helpful if you could send us the log file located at /tmp/fablib/fablib.log. This information will help us investigate and debug the issue more effectively.
Best regards,
Komal
Hi Fatih,
You are absolutely correct — in the FABRIC testbed, the term “host” refers to a single physical machine, not a group of blades or multiple servers.
Regarding your question about the core count:
The host you mentioned (for example,seat-w2) reports 128 CPUs because the physical server has two AMD EPYC processors, each with 32 physical cores and hyperthreading — enabled. This means each physical core presents two logical CPUs (threads) to the operating system.So, the breakdown is:
- 2 sockets × 32 physical cores per socket = 64 physical cores
- With hyperthreading (2 threads per core): 64 × 2 = 128 logical CPUs
Inside a VM, you’ll typically see the processor model name (e.g., AMD EPYC 7543 32-Core Processor), which corresponds to the physical CPU model installed in the host. The number of vCPUs visible in the VM depends on the resources allocated to it by the hypervisor, not the total physical core count of the host.
In summary:
- Host = one physical machine
- 128 cores = 64 physical cores × 2 threads (hyperthreading)
- VM CPU info = underlying processor model, showing only allocated vCPUs
You can find more details about the hardware configurations for a FABRIC site here:
https://learn.fabric-testbed.net/knowledge-base/fabric-site-hardware-configurations/Best regards,
KomalHi Fatih,
When requesting a VM in a slice, specifying the
hostparameter (for example,seat-w2.fabric-testbed.net) ensures that the VM is provisioned on that particular physical host. If multiple VMs in the same slice specify the same host (e.g.,seat-w2.fabric-testbed.net), they will all be co-located on that same physical machine.This can be done as follows:
slice.add_node(name="node1", host="seat-w2.fabric-testbed.net", ...) slice.add_node(name="node2", host="seat-w2.fabric-testbed.net", ...)If the
hostparameter is not specified, the FABRIC Orchestrator automatically places the VMs across available hosts based on resource availability, which may result in them being distributed across different physical machines.If the requested host cannot accommodate the VMs (due to limited capacity or resource constraints), the system will return an “Insufficient resources” error.
Best regards,
Komal - Reservation ID: 990127bd-aa06-4992-8847-c76654faf0e8
-
AuthorPosts