1. Komal Thareja

Komal Thareja

Forum Replies Created

Viewing 15 posts - 226 through 240 (of 455 total)
  • Author
    Posts
  • in reply to: TACC always failing with insufficient resources:Disk# #7058
    Komal Thareja
    Participant

      Hi Nishanth,

      The error Insufficient resources : [disk] implies that there is not enough disk available on the host on which your VM is being requested. Looking at your slice, following VM requesting a ConnectX5 is being rejected as it maps to tacc-w4  There is not enough disk available on tacc-w4 to accomodate your VM hence the failure.


      Reservation ID: 478b2a91-5a02-4cf0-9bcd-de04c3b873ea Slice ID: 30f9fb42-37be-420f-899e-082a41bfb735
      Resource Type: VM Notices: Reservation 478b2a91-5a02-4cf0-9bcd-de04c3b873ea (Slice Traffic Listening Demo TACC(30f9fb42-37be-420f-899e-082a41bfb735) Graph Id:a58b7bc7-55d6-42e9-b457-5a8a32ebebc9 Owner:nshyamkumar@iit.edu) is in state (Closed,None_) (Last ticket update: Insufficient resources : ['disk'])
      Start: 2024-06-05 17:55:24 +0000 End: 2024-06-06 17:55:23 +0000 Requested End: 2024-06-06 17:55:23 +0000
      Units: 1 State: Closed Pending State: None_
      Predecessors
      Sliver: {'node_id': '9a579143-79b2-44fb-bacb-e6a5db4da3bf', 'capacities': '{ core: 2 , ram: 8 G, disk: 1 G}', 'capacity_hints': '{ instance_type: fabric.c2.m8.d10}', 'image_ref': 'default_ubuntu_20', 'image_type': 'qcow2', 'name': 'TACC_node4', 'reservation_info': '{"reservation_id": "478b2a91-5a02-4cf0-9bcd-de04c3b873ea", "reservation_state": "Closed"}', 'site': 'TACC', 'type': 'VM', 'user_data': '{"fablib_data": {"instantiated": "False", "run_update_commands": "False", "post_boot_commands": [], "post_update_commands": []}}'}
      Component: {'node_id': '670d117f-19ac-477b-bff7-36ac4e90107a', 'details': 'Mellanox ConnectX-5 Dual Port 10/25GbE', 'model': 'ConnectX-5', 'name': 'TACC_node4-pmnic_2', 'type': 'SmartNIC', 'user_data': '{}'}
      NS: {'node_id': 'adeede90-a808-45a6-8e1e-8c8de7a4ee6e', 'layer': 'L2', 'name': 'TACC_node4-TACC_node4-pmnic_2-l2ovs', 'site': 'TACC', 'type': 'OVS'}
      IFS: {'node_id': '2f9a52b4-3108-48f2-b0f9-e0ccd7716cdc', 'capacities': '{ bw: 25 Gbps, unit: 1 }', 'labels': '{ local_name: p1}', 'name': 'TACC_node4-pmnic_2-p1', 'type': 'DedicatedPort', 'user_data': '{"fablib_data": {"mode": "config"}}'}
      IFS: {'node_id': 'b6c42c3e-a570-4ed1-b633-607e90777f34', 'capacities': '{ bw: 25 Gbps, unit: 1 }', 'labels': '{ local_name: p2}', 'name': 'TACC_node4-pmnic_2-p2', 'type': 'DedicatedPort', 'user_data': '{"fablib_data": {"mode": "config"}}'}

      Thanks,

      Komal

      in reply to: How to use long-lived tokens in experiments #7057
      Komal Thareja
      Participant

        Hi Nishanth,

        This issue has been fixed for a while now but is only available in Beyond Bleeding Edge Container.

        Could you please use that? This should be available in the pypi with the next release.

        Thanks,

        Komal

        in reply to: Do we have UEFI firmware boot mode option for nodes? #7039
        Komal Thareja
        Participant

          Hi Acheme,

          We investigated the possibility of enabling UEFI mode for users but encountered issues where GPUs do not function in that mode. Consequently, we have opted to maintain updated firmware to mitigate these errors for users. Could you please rerun your experiment and inform us if the error persists? I am available to collaborate with you on upgrading the firmware and addressing the issue.

          Thanks,

          Komal

          Komal Thareja
          Participant

            The updated network model has been deployed and the maintenance is complete.

            in reply to: How can we restore our files from deleted nodes #7023
            Komal Thareja
            Participant

              Hi Emmanuel,

              It is not possible to recover a deleted slice. Apologies we may not be able to recover your data. However, you should be able to request renewal of an expired project though.

              Thanks,

              Komal

              Komal Thareja
              Participant

                To clarify, requesting two VMs is acceptable. However, requesting VMs with GPUs and SmartNICs in the mentioned slice is invalid because none of the hosts have SmartNICs and GPUs available on the same host.

                Komal Thareja
                Participant

                  Hello Khawar,

                  Your slice is requesting 2 VMs. This is unsupported configuration. On UTAH, we have two hosts each with 3 GPUs but none of them have a dedicated CX-6. So your slice configuration is seemed unsupported.

                  Also, I checked, all 6 RTX-6000 GPUs are in use. Please note that the resource usage displayed on the portal may be outdated by 30 minutes.

                  • n1 – with RTX-6000 GPU and dedicated NIC CX-6
                  • n2 – with two RTX-600 GPU

                  We do have ongoing work for users to identify such invalid slice configurations using fablib API. This should be available soon with the upcoming Release 1.7. We also plan to provide host level resource usage details to user in 1.7 that may help with this too. Hope this helps!

                  Thanks,

                  Komal

                  in reply to: Assigned addresses lost in reserved slices #7008
                  Komal Thareja
                  Participant

                    @Nirmala – Maintenance has been completed.

                    in reply to: Assigned addresses lost in reserved slices #7006
                    Komal Thareja
                    Participant

                      Hello Nirmala,

                      Apologies for the inconvenience. We have a maintenance ongoing and hence the error on the portal.

                      Will keep you posted as soon as maintenance is complete.

                      Maintenance on the testbed – May 9 – 8am-12pm EST

                      Thanks,

                      Komal

                      in reply to: Assigned addresses lost in reserved slices #7001
                      Komal Thareja
                      Participant

                        Hi Nirmala,

                        Could you please share your Slice ID or if possible please share your notebook? I can help tailor it to handle this scenario.

                        Thanks,

                        Komal

                        in reply to: Assigned addresses lost in reserved slices #6990
                        Komal Thareja
                        Participant

                          Hello Nirmala,

                          Over the weekend, we encountered memory failures on the Wash workers, necessitating their reboot. Unfortunately, this led to the loss of IP addresses of your VMs. Rest assured, we are actively addressing the memory failure issue to prevent further worker reboots.

                          In the meantime, you can utilize the following block in a notebook to restore your IP configuration without having to delete your slice. We apologize for any inconvenience this may have caused.

                          
                          try:
                          slice = fablib.get_slice(name=slice_name)
                          for node in slice.get_nodes():
                          print(f"{node}")
                          node.config()
                          except Exception as e:
                          print(f"Exception: {e}")
                          

                          Thank you for your understanding,

                          Komal

                          in reply to: login to server failure #6988
                          Komal Thareja
                          Participant

                            @Vaiden, @Nirmala,

                            The issue has been resolved. Jupyter Hub is accessible now. Please let us know if you still run into any issues.

                            Thanks,

                            Komal

                            in reply to: Outage at FABRIC Jupyter Hub #6987
                            Komal Thareja
                            Participant

                              This issue has been resolved and Jupyter Hub is accessible again.

                              Thanks,

                              Komal

                              in reply to: login to server failure #6985
                              Komal Thareja
                              Participant

                                Hi Nirmala,

                                Thank you for reporting this. It looks like ours K8s cluster hosting Jupyter Hub is down. We are working to resolve this and will keep you posted.

                                Thanks,

                                Komal

                                in reply to: How to reach Nginx being hosted via IPv4 #6974
                                Komal Thareja
                                Participant

                                  Hi Jacob,

                                  I used nslookup to determine the FQDN for your server and can confirm that I can ping your host as shown below.
                                  SALT is IPv6-only site. I will check and confirm if FABRIC NAT server config needs changes to enable this. But the reachability is working with FQDN/hostname.


                                  root@TransferNode:~# nslookup 129.114.108.207
                                  207.108.114.129.in-addr.arpa name = chi-dyn-129-114-108-207.tacc.chameleoncloud.org.


                                  root@TransferNode:~#
                                  root@TransferNode:~#
                                  root@TransferNode:~#
                                  root@TransferNode:~# ping chi-dyn-129-114-108-207.tacc.chameleoncloud.org
                                  PING chi-dyn-129-114-108-207.tacc.chameleoncloud.org(chi-dyn-129-114-108-207.tacc.chameleoncloud.org (2600:2701:5000:5001::8172:6ccf)) 56 data bytes
                                  64 bytes from chi-dyn-129-114-108-207.tacc.chameleoncloud.org (2600:2701:5000:5001::8172:6ccf): icmp_seq=1 ttl=35 time=113 ms
                                  64 bytes from chi-dyn-129-114-108-207.tacc.chameleoncloud.org (2600:2701:5000:5001::8172:6ccf): icmp_seq=2 ttl=35 time=113 ms
                                  64 bytes from chi-dyn-129-114-108-207.tacc.chameleoncloud.org (2600:2701:5000:5001::8172:6ccf): icmp_seq=3 ttl=35 time=113 ms
                                  64 bytes from chi-dyn-129-114-108-207.tacc.chameleoncloud.org (2600:2701:5000:5001::8172:6ccf): icmp_seq=4 ttl=35 time=113 ms
                                  64 bytes from chi-dyn-129-114-108-207.tacc.chameleoncloud.org (2600:2701:5000:5001::8172:6ccf): icmp_seq=5 ttl=35 time=113 ms
                                  64 bytes from chi-dyn-129-114-108-207.tacc.chameleoncloud.org (2600:2701:5000:5001::8172:6ccf): icmp_seq=6 ttl=35 time=113 ms
                                  64 bytes from chi-dyn-129-114-108-207.tacc.chameleoncloud.org (2600:2701:5000:5001::8172:6ccf): icmp_seq=7 ttl=35 time=113 ms
                                  64 bytes from chi-dyn-129-114-108-207.tacc.chameleoncloud.org (2600:2701:5000:5001::8172:6ccf): icmp_seq=8 ttl=35 time=113 ms
                                  64 bytes from chi-dyn-129-114-108-207.tacc.chameleoncloud.org (2600:2701:5000:5001::8172:6ccf): icmp_seq=9 ttl=35 time=113 ms
                                  64 bytes from chi-dyn-129-114-108-207.tacc.chameleoncloud.org (2600:2701:5000:5001::8172:6ccf): icmp_seq=10 ttl=35 time=113 ms

                                  Thanks,
                                  Komal

                                Viewing 15 posts - 226 through 240 (of 455 total)