1. Komal Thareja

Komal Thareja

Forum Replies Created

Viewing 15 posts - 16 through 30 (of 557 total)
  • Author
    Posts
  • in reply to: Cannot allocate GPU + ConnectX-6 on same node #9727
    Komal Thareja
    Moderator

      Portal view has been fixed too! Portal now shows the state of resources correctly.

      Best,

      Komal

      in reply to: Cannot allocate GPU + ConnectX-6 on same node #9726
      Komal Thareja
      Moderator

        Hi Bek,

        Just a heads-up — the resource status on the portal isn’t quite matching the actual state of the resources right now. I’m working to get that sorted, but in the meantime you can use the fablib API to check availability and find an open slot for your target slice.

        Here’s an artifact that should come in handy: https://artifacts.fabric-testbed.net/artifacts/e777ce3a-5b40-4e58-9666-7f31f655f03c

        Best,

        Komal

        Komal Thareja
        Moderator

          Hi Sree,

          I’m investigating the extend/renew of this slice. That said, I’d strongly recommend backing up your data in the meantime — that way, if the slice ever needs to be recreated, you’ll have everything you need on hand.

          Best,
          Komal

          Komal Thareja
          Moderator

            Hi Sree,

            Could you please share your slice ID?

            Best,

            Komal

            Komal Thareja
            Moderator

              Hi Yifan,

              When creating a slice through the Portal, the network configuration needs to be set up manually. However, if you create the slice via the JupyterHub interface (Portal → JupyterHub), the network configuration is handled automatically. You can follow the steps outlined here: https://learn.fabric-testbed.net/knowledge-base/creating-your-first-experiment-in-jupyter-hub/

              Best,
              Komal

              1 user thanked author for this post.
              Komal Thareja
              Moderator

                Hi Yifan,

                I’m not sure how the VMs were originally provisioned—whether auto configuration or manual setup was used, or which JupyterHub container was involved.

                I checked your MASS VMs and noticed that IPv6 addresses were not assigned to the data plane interfaces and the required routes were missing. I manually configured both VMs by assigning IPv6 addresses and adding the appropriate routes:

                mass-0:

                sudo ip -6 addr add 2602:fcfb:7:1::2/64 dev enp7s0
                sudo ip link set enp7s0 up
                sudo ip -6 route add 2602:fcfb:00::/40 via 2602:fcfb:7:1::1 dev enp7s0
                

                mass-1:

                sudo ip -6 addr add 2602:fcfb:7:1::3/64 dev enp7s0
                sudo ip link set enp7s0 up
                sudo ip -6 route add 2602:fcfb:00::/40 via 2602:fcfb:7:1::1 dev enp7s0
                

                After applying these changes, connectivity between the MASS VMs is working as expected (verified via ping).

                I also attempted to access the UTAH and ATLA VMs, but I wasn’t able to SSH using the NOVA keys, so I couldn’t validate their configuration.

                Could you please run the following commands on the remaining VMs to configure the data plane interfaces?

                UTAH VMs

                ut-0:

                sudo ip -6 addr add 2602:fcfb:8:d1::2/64 dev enp7s0
                sudo ip link set enp7s0 up
                sudo ip -6 route add 2602:fcfb:00::/40 via 2602:fcfb:8:d1::1 dev enp7s0
                

                ut-1:

                sudo ip -6 addr add 2602:fcfb:8:d1::3/64 dev enp7s0
                sudo ip link set enp7s0 up
                sudo ip -6 route add 2602:fcfb:00::/40 via 2602:fcfb:8:d1::1 dev enp7s0
                

                ATLA VMs

                atl-0:

                sudo ip -6 addr add 2602:fcfb:15:1::2/64 dev enp7s0
                sudo ip link set enp7s0 up
                sudo ip -6 route add 2602:fcfb:00::/40 via 2602:fcfb:15:1::1 dev enp7s0
                

                atl-1:

                sudo ip -6 addr add 2602:fcfb:15:1::3/64 dev enp7s0
                sudo ip link set enp7s0 up
                sudo ip -6 route add 2602:fcfb:00::/40 via 2602:fcfb:15:1::1 dev enp7s0
                

                GATECH VMs

                gatech-0:

                sudo ip -6 addr add 2602:fcfb:11:2::3/64 dev enp7s0
                sudo ip link set enp7s0 up
                sudo ip -6 route add 2602:fcfb:00::/40 via 2602:fcfb:11:2::1 dev enp7s0
                

                gatech-1:

                sudo ip -6 addr add 2602:fcfb:11:2::2/64 dev enp7s0
                sudo ip link set enp7s0 up
                sudo ip -6 route add 2602:fcfb:00::/40 via 2602:fcfb:11:2::1 dev enp7s0
                

                WASH VMs

                wash-0:

                sudo ip -6 addr add 2602:fcfb:a:1::3/64 dev enp7s0
                sudo ip link set enp7s0 up
                sudo ip -6 route add 2602:fcfb:00::/40 via 2602:fcfb:a:1::1 dev enp7s0
                

                wash-1:

                sudo ip -6 addr add 2602:fcfb:a:1::2/64 dev enp7s0
                sudo ip link set enp7s0 up
                sudo ip -6 route add 2602:fcfb:00::/40 via 2602:fcfb:a:1::1 dev enp7s0
                

                LOSA VMs

                la-0 (uses enp6s0):

                sudo ip -6 addr add 2602:fcfb:12:c::3/64 dev enp6s0
                sudo ip link set enp6s0 up
                sudo ip -6 route add 2602:fcfb:00::/40 via 2602:fcfb:12:c::1 dev enp6s0
                

                la-1 (uses enp6s0):

                sudo ip -6 addr add 2602:fcfb:12:c::2/64 dev enp6s0
                sudo ip link set enp6s0 up
                sudo ip -6 route add 2602:fcfb:00::/40 via 2602:fcfb:12:c::1 dev enp6s0
                

                Note: The LOSA VMs use enp6s0 instead of enp7s0 for the data plane interface.

                Please let me know if you need any help with this.

                Best,
                Komal

                1 user thanked author for this post.
                Komal Thareja
                Moderator

                  Hi Yifan,

                  Could you please share your slice id?

                  Best,

                  Komal

                  Komal Thareja
                  Moderator

                    You should be able to re-use the existing slice.

                    Just run the following in a cell.

                    slice=fablib.get_slice(slice_name)

                    slice.post_boot_config()

                    slice.list_nodes();

                    slice.list_interfaces();

                    Thanks,

                    Komal

                    Komal Thareja
                    Moderator

                      Hi Rasman,

                      I tried both your shared NICs example and the iperf3 (CX5) notebook, and I do see IPs being configured on the VMs.

                      Could you please run the following notebook:
                      jupyter-examples-*/configure_and_validate/configure_and_validate.ipynb?

                      It’s possible that your bastion keys have expired, which may be preventing fablib from properly configuring the nodes.

                      I’ve attached a snapshot of the output from my runs below for reference.

                      Best,
                      Komal

                      Komal Thareja
                      Moderator

                        Hi Rasman,

                        Which JH container are you using?

                        Best,

                        Komal

                        in reply to: pin_cpu & poa(operation=”cpupin”) #9620
                        Komal Thareja
                        Moderator

                          Thank you for sharing your observations, @yoursunny. This was indeed a bug, and it has now been fixed in the Beyond Bleeding Edge container.

                          I’ll be rolling out the fix to the Bleeding Edge container shortly as well.

                          Best,
                          Komal

                          Komal Thareja
                          Moderator

                            Hi Rasman,

                            Great question, and thanks for checking before running your experiments — we appreciate that!

                            As yoursunny mentioned, you’ll want to use FABNetv4Ext or FABNetv6Ext network services for your experiment rather than the management network. These provide dedicated public Internet connectivity for your slices and are designed for exactly this kind of bulk data transfer work. The management network is shared infrastructure and should not be used for high-volume traffic.

                            One important thing to note: FABNetv4Ext and FABNetv6Ext require additional project permissions that are not enabled by default. Your Project Lead will need to request the Net.FABNetv4Ext and/or Net.FABNetv6Ext permissions for your project through the FABRIC Portal (use the “Request additional project permissions” option under Experiments -> Projects).

                            Once you have those permissions, you should be all set to run sustained download experiments against NCBI/ENA without any issues on the FABRIC side.

                            Also, thanks yoursunny for jumping in with the helpful pointer!

                            Best,
                            Komal

                            in reply to: Slice Renewal Stuck in Configuring State #9602
                            Komal Thareja
                            Moderator

                              Hi Fatih,

                              I looked into your slice (698e8e21). During the renewal attempt, several VMs failed to renew due to insufficient resources on the target workers. These closed on 2026-03-16 initial end date.

                              – 4 VMs failed due to insufficient RAM (on ncsa-w1 and other workers)
                              – 2 VMs failed due to insufficient cores (on mich-w2, mich-w3)

                              These VM failures caused a cascade: their dependent network services (L2Bridge, L2PTP) were also closed on expiry i.e. function without the underlying VMs. In total, 85 out of 129 reservations were closed and 3 additional network services were cleaned up.

                              The slice was stuck in Configuring because some network reservations were waiting indefinitely for their dead predecessor VMs. I have deployed a fix that now properly detects this condition and closes those stuck reservations, which is why the slice has transitioned out of the Configuring state.

                              Unfortunately, this slice cannot be recovered in its current state — too many VMs and their dependent network services have been closed. I recommend deleting this slice and creating a new one. To avoid resource contention, you may want to check site availability before submitting and consider spreading your VMs across sites with more available capacity, or using smaller VM flavors.

                              Please let us know if you need any further assistance.

                              NOTE: Please note that with advanced reservations in play, renew/extend is not always guaranteed as the resources may have been acquired by someone else.

                              Best regards,
                              Komal

                              in reply to: slice hungup on configuring #9588
                              Komal Thareja
                              Moderator

                                Hi Nirmala,

                                This looks like a bug. I am investigating it and will work to deploy a fix for this soon. Apologies for the inconvenience.

                                Best,

                                Komal

                                Komal Thareja
                                Moderator

                                  Hi Sree,

                                  VMs cannot communicate with each other over the private IPs assigned to interfaces connected to the management network. The interfaces with addresses in the 10.* range belong to this management network. Inter-VM communication should instead occur over the data plane network, which in your case is the L2Bridge network.

                                  I reviewed your slice and noticed that you have three VMs and two L2Bridge networks configured. However, the IP addresses on the VM interfaces are not set up correctly. Each network must use a different subnet, and the corresponding VM interfaces should be assigned IP addresses from those respective subnets.

                                  Please refer to the following example notebook, which demonstrates how to correctly configure the network:
                                  jupyter-examples-*/fabric_examples/fablib_api/create_l2network_basic/create_l2network_basic_auto.ipynb

                                  Make sure to use separate subnets for each network and assign the appropriate IPs to the VM interfaces so that communication works properly.

                                  Best,
                                  Komal

                                Viewing 15 posts - 16 through 30 (of 557 total)