1. Fail to connect node due to “No route to host”

Fail to connect node due to “No route to host”

Home Forums FABRIC General Questions and Discussion Fail to connect node due to “No route to host”

Viewing 11 posts - 1 through 11 (of 11 total)
  • Author
    Posts
  • #1687
    Guanlong Wu
    Participant

      Hi,

      Just a few hours ago, I found that my connection to the node was closed by the remote host, and I can not ssh to the node anymore.

      (base) fabric@jupyter-gw6sh-40virginia-2eedu:~/work$ ssh -i /home/fabric/work/BGPserverkey -J gw6sh_0000005018@bastion-1.fabric-testbed.net rocky@2001:1948:417:7:f816:3eff:fe37:1eaf
      channel 0: open failed: connect failed: No route to host
      stdio forwarding failed
      kex_exchange_identification: Connection closed by remote host

      Also, when I try to use paramiko to execute through jupyterhub, it gives me:

      Exception: ChannelException(2, 'Connect failed')

      At first, I thought it might be something wrong with the server (considering I didn’t make a change to the settings), but the node status shows that it’s still active. Could you please help me with this? Thanks so much!!

       

      Best,

      Guanlong

      #1688
      Paul Ruth
      Keymaster

        I think your slice and nodes have expired and no longer exist. By default a slice will expire after 24 hours.

        If you have a fablib application that queried for your slice before it expired you might see a resulting status from the old query. Try getting the slice again and I suspect you will see that it no longer exists.

        Paul

        • This reply was modified 1 year, 11 months ago by Paul Ruth.
        #1690
        Guanlong Wu
        Participant

          Thank you so much Paul!!

          Actually I renewed my slice in the very beginning, but the lease-end didn’t change. I guess it’s just a display problem because when I tried to renew a shorter lease-end it gives me the “HTTP response body: Attempted new term end time is shorter than current slice end time” (please refer to the attached).

          Also, now the slice still exists (no matter from FabricPortal/Experiments/Myslices, or by get_slice function in fablib application). The slice state is still “StableOK” and the lease-end is still the old one (Please refer to the attached).

          Sorry for the late reply due to different timezones 🙂

           

          Thanks,

          Guanlong

          #1694
          Paul Ruth
          Keymaster

            We checked and your VM is there but we can’t ping it or ssh to it. Is seems like there is something misconfigured in the VM. Maybe a misconfigured IP or route.

            Were you ever able to ssh to the VM? Does your experiment involve re-configuring IPs or routes? I’m wondering if any changes that your experiment made to the VM misconfigured an IP or route locked you out.

            Paul

            #1695
            Guanlong Wu
            Participant

              Yes the ssh worked perfectly before, and the experiments are just some local simulation/data-analysis work which definitely has nothing to do with IP or route.

              I remember I was running the python-based data analysis experiment at that time (which I have successfully run before on the VM), and after a while it said the connection is closed by the remote host and I can not get access to the VM anymore.

              Thanks,

              Guanlong

              #1713
              Paul Ruth
              Keymaster

                I’m not really sure what happened here but it seems like something inside the VM changed. I recommend recreating the slice.

                Let us know if you see issues like this again. I would be helpful if you could create a Jupyter notebook that recreates the error so we can investigate further.

                #1722
                Guanlong Wu
                Participant

                  Got it. Thanks so much!

                  Best,

                  Guanlong

                  #1730
                  Guanlong Wu
                  Participant

                    Hi Paul,

                    Sorry to bother you again, but actually I see the same issue several times. I tried different sites (including Utah, TACC), and the connection will automatically close after a certain period of time.

                    I feel like this is a renewal issue, but I renew the slice just according to the Jupyter examples. Also the status of the slice/node is still active, could you please help me with it? Thanks!

                    Renew the slice:

                    import datetime

                    #Set end host to now plus 1 day
                    end_date = (datetime.datetime.now() + datetime.timedelta(days=2)).strftime("%Y-%m-%d %H:%M:%S")
                    
                    try:
                    slice = fablib.get_slice(name=slice_name)
                    slice.renew(end_date)
                    except Exception as e:
                    print(f"Exception: {e}")
                    #1732
                    Paul Ruth
                    Keymaster

                      There are some issues related to the renewing a slice that will be fixed with an update that is coming in the next few weeks. I don’t know if this part of the fix, but it might be. It might be best to not rely on renewal until then.

                      In general, it is best to automate the deployment of any experiment using Jupyter notebooks or other scripts. This way you can shutdown your slice when you are not actively working on it and easily restart it the next day when you start working again. This is especially important if you are using scarce resources that other experimenters might want to use. It also helps you navigate any scheduled (or unscheduled) downtime the testbed might experience. In addition, it also helps you publish repeatable experiments that can easily be shared with others who can re-run your experiment on their own.

                      Long-lived slices, that require renewals, should be reserved for times when you are running an experiment that needs to actively run and collect data for longer periods of time. If you are just trying to set up an experiment, you will be more successful if you incorporate automation into your workflow from the beginning. In my experience with testbeds, the users who do not automate the deployment of their experiment have a lot harder time getting things to work and, in the end, don’t really know what is in the base environment they are using for their experiment.

                      #1749
                      Guanlong Wu
                      Participant

                        Thank you for being such a great help! It’s just my experiment involves a huge amount of dataset and I have to redownload it every time I recreate the slice. Anyways, I think automation could be a great help and I’ll incorporate it into my workflow in no time.

                        Best,

                        Guanlong

                        #1750
                        Paul Ruth
                        Keymaster

                          Yes, the bigger and more complicated the slice, the more automation will make it easier to be resilient to any problems external to, or within, your experiment.

                          Longer-term FABRIC will have persistent storage for larger data sets.  This is not available yet but watch out for it.  This capability should make it easier to stage large data sets without relying on a persistent VMs.  Generally, relying on persistent VMs for storing and serving large data sets can be unreliable.

                          Maybe I can help design the best way to store and serve your data. I have a few questions:

                          • How much data do you need to store?
                          • Where is the data currently stored (i.e. where do you need to copy it from?)
                          • How does your application consume/process the data? Does the app need all the data or just a subset? Does the app need the data locally or can it be served remotely?

                          I might have a few more questions but this will get us started.

                          thanks,

                          Paul

                           

                           

                        Viewing 11 posts - 1 through 11 (of 11 total)
                        • You must be logged in to reply to this topic.