1. Komal Thareja

Komal Thareja

Forum Replies Created

Viewing 15 posts - 76 through 90 (of 416 total)
  • Author
    Posts
  • in reply to: Hardware Steering – Connectx6 #7950
    Komal Thareja
    Participant

      Hi Tanay,

      I wanted to check if this issue is still unresolved. I haven’t had a chance to look into it yet, but I plan to review the documentation and experiment with a few approaches. I’ll share any updates or findings here after the holidays.

      Thanks,
      Komal

      in reply to: Unable to access VMs #7941
      Komal Thareja
      Participant

        Hi Rodrigo,

        It’s possible that your bastion keys have expired. Could you please check the expiration of the keys from the Portal via Experiments -> Manage SSH Keys?

        Also, please try running the notebook jupyter-examples-rel1.7.1/configure_and_validate.ipynb ?

        This notebook shall regenerate the bastion keys if the keys have expired. Please very SSH access after that.

        Please let us know if you still see errors.

        Thanks,

        Komal

        in reply to: Setting up Kubernetes cluster on FABRIC #7928
        Komal Thareja
        Participant

          Thank you so much @Fraida! Could we please request you to consider uploading this to Fabric Artifacts to enable other Fabric users to leverage this?

          Appreciate your help with this!

          Artifact Manager: https://artifacts.fabric-testbed.net/artifacts/

          in reply to: Insufficient resources error despite available resources #7924
          Komal Thareja
          Participant

            Hi Jestus,

            This error occurs when the host capable of provisioning the requested resource has run out of cores and RAM. While the resource view provides cumulative information for the entire site, checking resource availability at the host level offers more precise insights. This is available on the portal for each site resource view and also can be checked via API as shown by list_hosts in example here.

            It’s possible that the combination of requested components (such as NICs or GPUs) maps to a host without sufficient cores or RAM, leading to the error you’ve encountered.

            We have an example notebook  (Additional Options: Validate Slice) available that allows you to validate resource availability beforehand using the API, which can be helpful prior to submitting a slice. Additionally, we’re working on changes to the allocation policy to better distribute VMs across hosts. This will help ensure that CPUs, RAM and disk are not fully allocated on single host which has SmartNICs and GPUs, minimizing such errors. These updates are planned for deployment in the January Release and should improve resource allocation.

            Thanks,
            Komal

            • This reply was modified 4 months, 2 weeks ago by Komal Thareja.
            Komal Thareja
            Participant

              Please refer to this example for removing interfaces from a network as well as a node.

              Thanks,

              Komal

               

               

              in reply to: Multiple GPUs on a node? #7906
              Komal Thareja
              Participant

                Possibly you changed the call from list_hosts to list_sites. Please see the snippet below.

                None of the hosts on FIU have more than 3 GPUs. Also, even 3 can be requested based on availability.

                The screenshot only shows the full capacity. You can also check this from portal too.

                FIU per host information can be seen here: https://portal.fabric-testbed.net/sites/FIU

                Thanks,

                Komal

                in reply to: Multiple GPUs on a node? #7902
                Komal Thareja
                Participant

                  Hi Abdulhadi,

                  The GPU count you are referring to represents the total number of GPUs available at a site.

                  No single host at a site has more than 3 GPUs. In fact, only a few hosts are equipped with 3 GPUs. To check the per-host resource details, you can use the notebook: jupyter-examples-main/fabric_examples/fablib_api/sites_and_resources/list_all_resources.ipynb.

                  For convenience, the following code snippet can also be used:


                  from fabrictestbed_extensions.fablib.fablib import FablibManager as fablib_manager
                  fablib = fablib_manager()
                  fablib.show_config();
                  fields=['name', 'tesla_t4_capacity','rtx6000_capacity', 'a30_capacity', 'a40_capacity']
                  output_table = fablib.list_hosts(fields=fields)

                  Thanks,

                  Komal

                  in reply to: Slice stuck in ‘Configuring’ on extend #7900
                  Komal Thareja
                  Participant

                    Hi Ilya,

                    I looked into your slice and found that it was partially renewed, with the VM on STAR not renewing completely.

                    This appears to be a side effect of the Kafka maintenance we conducted yesterday, which impacted STAR. During this time, renewal messages were not processed because the Kafka consumer had stopped. I’ve resolved the issue, and future renewals should now work as expected.

                    Thank you for bringing this to our attention and helping us identify and fix the problem.

                    Best regards,
                    Komal

                    P.S: Another user also ran into this: https://learn.fabric-testbed.net/forums/topic/not-able-to-renew-the-slice/

                    in reply to: Not able to renew the slice #7899
                    Komal Thareja
                    Participant

                      Hi Sankalpa,

                      Both your slices were partially renewed. Each slice included a VM on STAR, where the renewal process was stuck.

                      We use a Kafka messaging bus, and there was a brief maintenance yesterday that impacted STAR. As a result, renewal messages were not processed because the Kafka consumer had stopped. I have resolved this issue, and all the slivers in your slices have been successfully renewed. Your slice is now in the StableOK state.

                      Thank you for reporting this and helping us identify and address the problem.

                      Best regards,
                      Komal

                      in reply to: Not able to renew the slice #7897
                      Komal Thareja
                      Participant

                        Please share the slice ID. Slice ID can be captured from the Portal as well as from JH.

                        Portal -> Experiments -> My Slices -> Copy the Slice ID.

                        Also, how are you renewing the slices – Portal or JH?

                        Thanks,

                        Komal

                        in reply to: Not able to renew the slice #7895
                        Komal Thareja
                        Participant

                          Hi,

                          Could you please share your slice id?

                          Thanks,

                          Komal

                          in reply to: Slice stuck in ‘Configuring’ on extend #7893
                          Komal Thareja
                          Participant

                            Hi Ilya,

                            Thank you for reporting this issue. It seems to be a bug, and I’m in the process of debugging it. In the meantime, I’ve closed your slice, so it should no longer show up as “Configuring.”

                            Best regards,
                            Komal

                            in reply to: Permission denied for in-slice port mirroring #7889
                            Komal Thareja
                            Participant

                              Hi Vaneshi,

                              Permission updated would be rolled out with Release 1.8 in January.

                              Thanks,

                              Komal

                              in reply to: Cant Access ‘classifier’ node in Slice #7873
                              Komal Thareja
                              Participant

                                Hi Sourya,

                                It looks like the authorized_keys file is not correct. I am not even able to login to nova SSH keys.

                                Could you please confirm if you see a key which ends with Generated-by-Nova  in /home/ubuntu/.ssh/authorized_keys ?

                                Also, please share the output of the command ls -ltr /home/ubuntu/.ssh/ ?

                                Thanks,

                                Komal

                                in reply to: GPU + Connectx6 SmartNIC node #7866
                                Komal Thareja
                                Participant

                                  Correction: Both the CERN and CIEN racks have ConnectX-6 and GPU available on the same host. However, the CIEN rack is currently under maintenance as it is being transported back from SC.

                                  You can proceed with your experiment on the CERN rack, subject to its availability.

                                  Additionally, here’s a Fablib code snippet to help you check for specific resources on hosts:


                                  fields=['name','nic_connectx_6_capacity','nic_connectx_5_capacity','tesla_t4_capacity','rtx6000_capacity', 'a30_capacity', 'a40_capacity']
                                  output_table = fablib.list_hosts(fields=fields)

                                  Thanks,
                                  Komal

                                  1 user thanked author for this post.
                                Viewing 15 posts - 76 through 90 (of 416 total)