1. Slice submit via Jupyter get’s stuck

Slice submit via Jupyter get’s stuck

Home Forums FABRIC General Questions and Discussion Slice submit via Jupyter get’s stuck

Viewing 12 posts - 1 through 12 (of 12 total)
  • Author
    Posts
  • #8195
    Justas Balcas
    Participant

      Hi,

      I submit a slice via Jupyter Notebook, which gets stuck, see the images below. If I look at the fabric portal – it shows StableOK, while the terminal remains pending. I do ctrl+c like 20mins after it is StableOK on Fabric Portal – and I took a screenshot of there it was stuck. Additionally – if I ask to print ssh commands after this – ssh always fails

      `

      fabric@winter:JustasB-FreeRTR-86%$ ssh -i /home/fabric/work/fabric_config/slice_key -F /home/fabric/work/fabric_config/ssh_config ubuntu@2001:400:a100:3090:f816:3eff:fee2:79ca
      Warning: Permanently added ‘bastion.fabric-testbed.net’ (ED25519) to the list of known hosts.
      jbalcas_0000188368@bastion.fabric-testbed.net: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
      kex_exchange_identification: Connection closed by remote host
      Connection closed by UNKNOWN port 65535

      <code></code>`

      Slice name: FRR-cern, Slice-ID: b81a4277-1239-4a23-9f29-e2d81a218361

      Could you check what is wrong with it?

       

      #8196
      Justas Balcas
      Participant

        Hello Again, I just tried another slice on LOSA (not CERN) – it got stuck on Jupyter notebook slice.submit()

        10710a66-159b-4a25-b2d0-00fb98de85ab l3_meas_net_LOSA LOSA network Ticketed

        There have been no updates for the last 30 minutes. On fabric portal – it shows stableOK, while stuck on Jupyter.

        Slice name: FRR-losa, Slice ID: 0367f6f3-1331-49dc-9399-722616237a5b

        #8197
        Komal Thareja
        Participant

          Hi,

          Both your slices  are in Stable State. It seems like a bug in fablib or a race condition which is causing fablib to think the slice is still Configuring.

          As a workaround, could you please do the following?

          I am trying to reproduce this at my end and would work to fix this. Apologies for the inconvenience!


          slice=fablib.get_slice(slice_name)
          slice.post_boot_config()
          slice.list_nodes();
          slice.list_interfaces();


          Slice Name: FRR-losa Slice ID: 0367f6f3-1331-49dc-9399-722616237a5b Project ID: a57c7715-d871-4369-82e6-408c9a57a6e7 Project Name: UCSD-FABRIC test
          Graph ID: 071abcd4-f292-449d-a69a-da4768780546
          Slice owner: { name: orchestrator, guid: orchestrator-guid, oidc_sub_claim: 91f5ecc3-16ff-4f09-95ac-dfeee0c3b1e3, email: jbalcas@es.net}
          Slice state: StableOK
          Lease time: 2025-02-07 14:24:38+00:00

          Thanks,

          Komal

          #8198
          Komal Thareja
          Participant

            Also, could you please share which JH container are you using?

            Thanks,

            Komal

            #8200
            Justas Balcas
            Participant

              Hi, I tried your commands, and it gets stuck at slice.post_boot_config() – if I ctrl+c it seems to be stuck here [1]. I tried ssh commands manually – not able to ssh (as reported previously). For JH – edge for FRR-CERN, stable 1.8 for FRR-LOSA – both have same issue.

               

              [1]

              File /opt/conda/lib/python3.11/site-packages/fabrictestbed_extensions/fablib/interface.py:1075, in Interface.get_ip_addr(self)
                 1073 if self.get_mac() is None:
                 1074     return None
              -> 1075 return self.get_ip_addr_ssh()
              
              File /opt/conda/lib/python3.11/site-packages/fabrictestbed_extensions/fablib/interface.py:885, in Interface.get_ip_addr_ssh(self, dev)
                  878 """
                  879 Gets the ip addr info for this interface.
                  880 
                  881 :return ip addr info
                  882 :rtype: str
                  883 """
                  884 try:
              --> 885     stdout, stderr = self.get_node().execute("ip -j addr list", quiet=True)
                  887     addrs = json.loads(stdout)
                  889     dev = self.get_device_name()
              
              File /opt/conda/lib/python3.11/site-packages/fabrictestbed_extensions/fablib/node.py:1658, in Node.execute(self, command, retry, retry_interval, username, private_key_file, private_key_passphrase, quiet, read_timeout, timeout, output_file)
                 1653         logging.debug(
                 1654             f"SSH execute fail. Slice: {self.get_slice().get_name()}, Node: {self.get_name()}, trying again"
                 1655         )
                 1656         logging.debug(e, exc_info=True)
              -> 1658     time.sleep(retry_interval)
                 1659     pass
                 1661 # Clean-up of open connections and files.
                 1662 finally:
              #8201
              Komal Thareja
              Participant

                Could you please try this with Beyond Bleeding Edge Container? I wasn’t able to reproduce this issue there. Trying it with 1.8 Stable container now.

                Thanks,

                Komal

                #8203
                Justas Balcas
                Participant

                  Hi, So with Beyond Bleeding Edge – slice.submit() still get’s stuck same way and shows state: Configuring and there are two networks in Ticketed state. While portal – already shows stable-ok. I stopped it on Jupyter, and executed commands you provided [1]. It get’s stuck at post_boot_config() – Will keep it active (without cancelling it) and will let you know if it moves forward.

                  [1]

                  `

                  site=”CERN”
                  slice_name = f’FRR-{site.lower()}’
                  print(1)
                  slice=fablib.get_slice(slice_name)
                  print(2)
                  slice.list_nodes();
                  print(3)
                  slice.post_boot_config()
                  print(4)
                  slice.list_nodes();
                  print(5)
                  slice.list_interfaces();
                  print(6)

                  <code></code>`

                  #8204
                  Komal Thareja
                  Participant

                    Thank you Justas! I haven’t been able to reproduce this even on JH Stable 1.8 container. Could you please share /tmp/fablib/fablib.log file from your container?

                    Also, please share the sliceid of your new slice.

                    Thanks,

                    Komal

                    #8206
                    Justas Balcas
                    Participant

                      I put log here: /home/fabric/work/JustasB-FreeRTR/fablib.log (If not able to access – let me know if ok to upload here and it has no secret info inside).

                      Slice Name FRR-cern, sliceid: f01132de-75a7-4edd-bbb2-816ee76e824f

                      I see in the logs a lot of:

                      [16:15:52] {/home/fabric/fabrictestbed-extensions/fabrictestbed_extensions/fablib/slice.py:708} INFO – update : FRR-cern, count: 2
                      [16:15:52] {/home/fabric/fabrictestbed-extensions/fabrictestbed_extensions/fablib/slice.py:592} INFO – update_slice: FRR-cern, count: 23
                      [16:15:53] {/home/fabric/fabrictestbed-extensions/fabrictestbed_extensions/fablib/slice.py:634} INFO – update_topology: FRR-cern, count: 2
                      [16:15:56] {/home/fabric/fabrictestbed-extensions/fabrictestbed_extensions/fablib/slice.py:2013} INFO – post_boot_config: slice_name: FRR-cern, slice_id f01132de-75a7-4edd-bbb2-816ee76e824f
                      [16:15:56] {/home/fabric/fabrictestbed-extensions/fabrictestbed_extensions/fablib/slice.py:708} INFO – update : FRR-cern, count: 3
                      [16:15:56] {/home/fabric/fabrictestbed-extensions/fabrictestbed_extensions/fablib/slice.py:592} INFO – update_slice: FRR-cern, count: 24
                      [16:15:56] {/home/fabric/fabrictestbed-extensions/fabrictestbed_extensions/fablib/slice.py:634} INFO – update_topology: FRR-cern, count: 3
                      [16:15:57] {/home/fabric/fabrictestbed-extensions/fabrictestbed_extensions/fablib/node.py:1600} WARNING – Attempt 1 failed: Authentication failed.
                      [16:16:08] {/home/fabric/fabrictestbed-extensions/fabrictestbed_extensions/fablib/node.py:1600} WARNING – Attempt 2 failed: Authentication failed.
                      [16:16:18] {/home/fabric/fabrictestbed-extensions/fabrictestbed_extensions/fablib/node.py:1600} WARNING – Attempt 3 failed: Authentication failed.
                      [16:16:19] {/home/fabric/fabrictestbed-extensions/fabrictestbed_extensions/fablib/node.py:1600} WARNING – Attempt 1 failed: Authentication failed.
                      [16:16:30] {/home/fabric/fabrictestbed-extensions/fabrictestbed_extensions/fablib/node.py:1600} WARNING – Attempt 2 failed: Authentication failed.
                      [16:16:41] {/home/fabric/fabrictestbed-extensions/fabrictestbed_extensions/fablib/node.py:1600} WARNING – Attempt 3 failed: Authentication failed.
                      [16:16:41] {/home/fabric/fabrictestbed-extensions/fabrictestbed_extensions/fablib/node.py:1600} WARNING – Attempt 1 failed: Authentication failed.
                      [16:16:52] {/home/fabric/fabrictestbed-extensions/fabrictestbed_extensions/fablib/node.py:1600} WARNING – Attempt 2 failed: Authentication failed.
                      [16:17:03] {/home/fabric/fabrictestbed-extensions/fabrictestbed_extensions/fablib/node.py:1600} WARNING – Attempt 3 failed: Authentication failed.
                      [16:17:03] {/home/fabric/fabrictestbed-extensions/fabrictestbed_extensions/fablib/interface.py:914} WARNING – Authentication failed.

                      and auth failures repeats many times

                       

                      #8207
                      Komal Thareja
                      Participant

                        Authentication failed would explain the SSH errors you are observing. Could you please re-run this notebook: jupyter-examples-rel1.8.1/configure_and_validate/configure_and_validate.ipynb ?
                        This shall renew any expired keys. Please try your slice again after this.

                        Thanks,
                        Komal

                        #8210
                        Justas Balcas
                        Participant

                          So interestingly – my bastion key expired and was removed. It continued to allow to me to submit slice, but failed in an endless loop to authenticate during post_boot_config. It would be nice to report this error back to Jupyter. I will add for future always validate config before using my notebook.

                          I confirm now my new bastion key in use and it was verified. New slice submission seems worked now. Thank you for your help!

                          #8211
                          Komal Thareja
                          Participant

                            Glad to hear that worked! We will work to address this and add support to interrupt/return meaningful error in such cases.

                            Thanks,

                            Komal

                          Viewing 12 posts - 1 through 12 (of 12 total)
                          • You must be logged in to reply to this topic.