Forum Replies Created
-
AuthorPosts
-
June 18, 2025 at 12:15 pm in reply to: Lost network interface after rebooting of vm3 in a cluster #8623
Hi Ajay,
Thanks for reaching out. Could you please share any details about what may have caused the VM to crash? This information will help us better understand the root cause.
It appears that the PCI devices were detached from your VM during the crash. I’ve gone ahead and restored the VM — you should now be able to access it and use the GPUs as expected.
Please let me know if you continue to face any issues.
Best,
Komal TharejaAs a quick follow-up:
In addition to SmartNIC reservations, FABRIC also supports CPU pinning and NUMA tuning options, which can help further minimize resource contention on shared hosts. While these do not fully isolate you from other users on the physical host, they can significantly reduce interference for CPU and memory-bound workloads.
You can find working examples demonstrating how to request CPU pinning and NUMA-optimized resources in the FABRIC Jupyter notebook examples repository:
These examples show how you can specify pinned CPUs and memory placement via FABlib when creating your slice.
Please feel free to reach out if you’d like any assistance setting this up.
Best,
Komal
Dear Fatih,
Thank you for reaching out.
At present, there is no way to prevent other users from having VMs on the same physical host unless you are able to reserve the entire host (when host doesn’t have other allocations). However, one option you may consider is requesting SmartNIC-based resources (such as CX6 or CX5). When you reserve a SmartNIC, the NIC is dedicated exclusively to your slice, ensuring that only your experiment’s traffic passes through that NIC. While this does not isolate CPU or memory resources on the host, it can minimize potential interference on the data plane network traffic.
Please let us know if you would like assistance with requesting such resources or if you have any further questions.
Best regards,
Komal TharejaJupyter Hub is back up and accessible. You should be able to use JH containers. GCP outage has been resolved.
Refer https://learn.fabric-testbed.net/forums/topic/out/ for more details.
Thanks,
Komal
June 12, 2025 at 6:43 pm in reply to: Outage Jupyter Hub – Kubernetes PVC Attachment Errors Due to GCP Incident #8613Update: JupyterHub Access Restored — GCP Incident Resolved
The earlier Google Cloud Platform service disruption that was affecting JupyterHub logins and volume attachments has now been fully resolved. As of now, users should be able to log in and start their JupyterHub environments normally.
The root cause of the issue was a Google Cloud Service Control incident that intermittently prevented volume attachments across multiple GCP services. Full details of the incident are available here:
🔗 GCP Incident Summary (June 12, 2025)If you continue to encounter any issues starting your environment:
-
Try restarting your server from the JupyterHub control panel.
-
If the problem persists, please feel free to reach out to us.
Thank you for your patience while this upstream issue was being addressed.
Best regards,
Komal
Notice: Kubernetes PVC Attachment Errors Due to GCP Incident (June 12, 2025)
We are aware of an ongoing issue where some users may see errors when starting their JupyterHub environments. Affected users may encounter errors similar to:
AttachVolume.Attach failed for volume "pvc-..." : rpc error: code = Internal desc = Failed to getDisk: googleapi: Error 503: Policy checks are unavailable., backendError
Root cause:
This is due to a Google Cloud Platform (GCP) service disruption that is intermittently preventing Kubernetes from attaching persistent volumes. The issue is upstream of our environment and is being actively addressed by Google (see GCP Status).What should you do:
-
If you encounter this error when launching your JupyterHub environment, no action is needed on your part.
-
In most cases, the issue is temporary and will resolve automatically as the underlying cloud services recover.
-
We recommend waiting a few minutes and then retrying.
-
Please avoid repeated restarts or resubmissions, as Kubernetes will continue to attempt recovery automatically.
We will continue to monitor the situation and will update as more information becomes available. Thank you for your patience.
Best regards,
Komal
Hi Tanay,
We are in the process of procuring them. While they may not be available for the Summer release, we are targeting an incremental release or including them in the Fall 2025 release.
Best,
Komal
Hi Rodrigo,
Could you please share your slice ID and let us know how you’re trying to access the VMs—whether through Jupyter Hub or from your local environment?
Thanks,
KomalJune 6, 2025 at 11:18 am in reply to: Guaranteed Capacity and Traffic Prioritization across the Sites #8590Hi Fatih,
Thank you for your email and detailed questions.
At this time, FABRIC does not currently support guaranteed capacity or QoS prioritization on L2P2P links. The service operates as best-effort by default, and DSCP/ToS or VLAN PCP markings are not currently enforced across the underlying infrastructure.
That said, we are actively working to support guaranteed QoS using Explicit Route Options (ERO) in the L2P2P service. This capability is planned for inclusion in our upcoming Release 1.9, targeted for deployment in late July/early August. It will provide a way to request L2P2P links with specified bandwidth guarantees and rate-limiting.
We will share more details and guidance on how to configure these options as part of the release.
Please feel free to reach out with any further questions in the meantime.
Best regards,
Komal TharejaHi Alexander,
Based on our investigation so far, the VMs from your slice that are not passing traffic were hosted on
salt-w3.fabric-testbed.net
. We’ve identified that none of the VMs on this host are able to pass traffic. As a result, we have placed this worker into Maintenance mode and are actively investigating the issue.You should be able to create a new slice without encountering this problem, as
salt-w3
is now in Maintenance and will not be used for any new slices on the SALT site.Thanks,
Komal
Hi Sourya,
MASS is undergoing maintenance from June 2 to June 4, as noted [here].
Since your slice is set to expire on June 9, it will remain unaffected by the maintenance window. As mentioned in the announcement, your VM will be recovered, and your data will persist.
Thanks,
KomalThank you Alexander for sharing this. I have shared the details with the network team. Will keep you posted.
Thanks,
Komal
Hi Philips,
At the moment, we do not support guaranteed QoS. This feature will be available soon. In the meantime, you can use tools such as
tc
to manage bandwidth on the VMs.Thanks,
KomalHi Nishant,
Please find my responses inline below:
Once a user has reserved a slice with an FPGA, that resource is locked and cannot be acquired or modified by other users until the slice is released.
You’re correct—if the FPGA has been flashed with a workflow other than the EsNet workflow, it may fail.
However, we cannot guarantee the validity or state of the bitstream that was previously flashed by another user before you acquired the slice. This may leave the FPGA in an inconsistent or unusable state. In our experience, reflashing the FPGA with a known good (golden) image typically restores it to a usable state.
We are planning to share this golden image along with the notebook with users soon, so they can perform the reflash themselves when needed. In the meantime, if you’re currently blocked, please let me know the specific site you’re working with—I’ll check whether we can assist with reflashing the FPGA for you.
Thanks,
Komal
-
-
AuthorPosts