1. Unable to reach a node at GATECH

Unable to reach a node at GATECH

Home Forums FABRIC General Questions and Discussion Unable to reach a node at GATECH

Viewing 7 posts - 1 through 7 (of 7 total)
  • Author
    Posts
  • #6890
    Acheme Acheme
    Participant

      Hello,

       

      I have a node at GATECH which I was able to perform execute() on just this morning but now I get the error below when I try to execute():

      ---------------------------------------------------------------------------
      ChannelException                          Traceback (most recent call last)
      Cell In[22], line 2
            1 # Restart services for clean operations
      ----> 2 stdout, stderr = node1.execute("sudo pkill -9 python3")
            3 stdout, stderr = node2.execute("sudo systemctl restart nginx")
            4 stdout, stderr = node3.execute("sudo pkill -9 python3")
      
      File ~/anaconda3/lib/python3.11/site-packages/fabrictestbed_extensions/fablib/node.py:1564, in Node.execute(self, command, retry, retry_interval, username, private_key_file, private_key_passphrase, quiet, read_timeout, timeout, output_file)
         1559 logging.warning(
         1560     f"Exception in node.execute() (attempt #{attempt} of {retry}): {e}"
         1561 )
         1563 if attempt + 1 == retry:
      -> 1564     raise e
         1566 # Fail, try again
         1567 if self.get_fablib_manager().get_log_level() == logging.DEBUG:
      
      File ~/anaconda3/lib/python3.11/site-packages/fabrictestbed_extensions/fablib/node.py:1424, in Node.execute(self, command, retry, retry_interval, username, private_key_file, private_key_passphrase, quiet, read_timeout, timeout, output_file)
         1417 bastion.connect(
         1418     self.get_fablib_manager().get_bastion_host(),
         1419     username=bastion_username,
         1420     key_filename=bastion_key_file,
         1421 )
         1423 bastion_transport = bastion.get_transport()
      -> 1424 bastion_channel = bastion_transport.open_channel(
         1425     "direct-tcpip", dest_addr, src_addr
         1426 )
         1428 client = paramiko.SSHClient()
         1429 # client.load_system_host_keys()
         1430 # client.set_missing_host_key_policy(paramiko.MissingHostKeyPolicy())
      
      File ~/anaconda3/lib/python3.11/site-packages/paramiko/transport.py:1101, in Transport.open_channel(self, kind, dest_addr, src_addr, window_size, max_packet_size, timeout)
         1099 if e is None:
         1100     e = SSHException("Unable to open channel.")
      -> 1101 raise e
      
      ChannelException: ChannelException(2, 'Connect failed')

       

      When I try to ssh to the node I get the following error:

      Warning: Permanently added ‘bastion.fabric-testbed.net’ (ED25519) to the list of known hosts.
      channel 0: open failed: connect failed: No route to host
      stdio forwarding failed
      kex_exchange_identification: Connection closed by remote host
      Connection closed by UNKNOWN port 65535

       

      Regards,

      Acheme

      #6891
      Jingjing Fu
      Participant

        I recently got these errors and tried some ways, hope these can help.

        For error Connect Failed or Connection closed by UNKNOWN port 65535, I often check if my slice is StateOK or I will go to Jupyter->File->Hub control Panel-> Stop and start my server.

        If those do not work, try to access VM through portal and run the command there instead of node.execute().

         

        #6892
        Hussam Nasir
        Moderator

          Can you please provide us some details about your slice and VM ? WHat is the IP of the node you are trying to ssh into ?

          #6893
          Acheme Acheme
          Participant

            Yes, sliceID: 97c1aa87-f5e5-4ccb-9b63-a73c1be85825, VM at GATECH management IP: 2610:148:1f00:9f01:f816:3eff:fe49:cd8d

            #6894
            Hussam Nasir
            Moderator

              Can you check now. The VM had crashed. We restored it, but it may be missing its IP config on the data plane NICS that you may have to restore.

              #6895
              Acheme Acheme
              Participant

                I configured the IP address as it was before it crashed but cannot ping the other nodes using the dataplane. Is there something else I should do?

                #6896
                Hussam Nasir
                Moderator

                  The physical node has crashed again. I am working on restoring it now. Please restore your data plane IPs in about 2 hrs and check and let me know.

                Viewing 7 posts - 1 through 7 (of 7 total)
                • You must be logged in to reply to this topic.