Home › Forums › FABRIC General Questions and Discussion › Fail to connect node due to “No route to host”
- This topic has 10 replies, 2 voices, and was last updated 2 years, 7 months ago by Paul Ruth.
-
AuthorPosts
-
April 24, 2022 at 9:01 am #1687
Hi,
Just a few hours ago, I found that my connection to the node was closed by the remote host, and I can not ssh to the node anymore.
(base) fabric@jupyter-gw6sh-40virginia-2eedu:~/work$ ssh -i /home/fabric/work/BGPserverkey -J gw6sh_0000005018@bastion-1.fabric-testbed.net rocky@2001:1948:417:7:f816:3eff:fe37:1eaf channel 0: open failed: connect failed: No route to host stdio forwarding failed kex_exchange_identification: Connection closed by remote host
Also, when I try to use paramiko to execute through jupyterhub, it gives me:
Exception: ChannelException(2, 'Connect failed')
At first, I thought it might be something wrong with the server (considering I didn’t make a change to the settings), but the node status shows that it’s still active. Could you please help me with this? Thanks so much!!
Best,
Guanlong
April 24, 2022 at 10:14 am #1688I think your slice and nodes have expired and no longer exist. By default a slice will expire after 24 hours.
If you have a fablib application that queried for your slice before it expired you might see a resulting status from the old query. Try getting the slice again and I suspect you will see that it no longer exists.
Paul
- This reply was modified 2 years, 8 months ago by Paul Ruth.
April 24, 2022 at 9:21 pm #1690Thank you so much Paul!!
Actually I renewed my slice in the very beginning, but the lease-end didn’t change. I guess it’s just a display problem because when I tried to renew a shorter lease-end it gives me the “HTTP response body: Attempted new term end time is shorter than current slice end time” (please refer to the attached).
Also, now the slice still exists (no matter from FabricPortal/Experiments/Myslices, or by get_slice function in fablib application). The slice state is still “StableOK” and the lease-end is still the old one (Please refer to the attached).
Sorry for the late reply due to different timezones 🙂
Thanks,
Guanlong
April 25, 2022 at 9:24 am #1694We checked and your VM is there but we can’t ping it or ssh to it. Is seems like there is something misconfigured in the VM. Maybe a misconfigured IP or route.
Were you ever able to ssh to the VM? Does your experiment involve re-configuring IPs or routes? I’m wondering if any changes that your experiment made to the VM misconfigured an IP or route locked you out.
Paul
April 25, 2022 at 10:07 am #1695Yes the ssh worked perfectly before, and the experiments are just some local simulation/data-analysis work which definitely has nothing to do with IP or route.
I remember I was running the python-based data analysis experiment at that time (which I have successfully run before on the VM), and after a while it said the connection is closed by the remote host and I can not get access to the VM anymore.
Thanks,
Guanlong
April 26, 2022 at 9:53 am #1713I’m not really sure what happened here but it seems like something inside the VM changed. I recommend recreating the slice.
Let us know if you see issues like this again. I would be helpful if you could create a Jupyter notebook that recreates the error so we can investigate further.
April 26, 2022 at 9:33 pm #1722Got it. Thanks so much!
Best,
Guanlong
May 3, 2022 at 6:42 am #1730Hi Paul,
Sorry to bother you again, but actually I see the same issue several times. I tried different sites (including Utah, TACC), and the connection will automatically close after a certain period of time.
I feel like this is a renewal issue, but I renew the slice just according to the Jupyter examples. Also the status of the slice/node is still active, could you please help me with it? Thanks!
Renew the slice:
import datetime
#Set end host to now plus 1 day end_date = (datetime.datetime.now() + datetime.timedelta(days=2)).strftime("%Y-%m-%d %H:%M:%S") try: slice = fablib.get_slice(name=slice_name) slice.renew(end_date) except Exception as e: print(f"Exception: {e}")
May 3, 2022 at 8:56 am #1732There are some issues related to the renewing a slice that will be fixed with an update that is coming in the next few weeks. I don’t know if this part of the fix, but it might be. It might be best to not rely on renewal until then.
In general, it is best to automate the deployment of any experiment using Jupyter notebooks or other scripts. This way you can shutdown your slice when you are not actively working on it and easily restart it the next day when you start working again. This is especially important if you are using scarce resources that other experimenters might want to use. It also helps you navigate any scheduled (or unscheduled) downtime the testbed might experience. In addition, it also helps you publish repeatable experiments that can easily be shared with others who can re-run your experiment on their own.
Long-lived slices, that require renewals, should be reserved for times when you are running an experiment that needs to actively run and collect data for longer periods of time. If you are just trying to set up an experiment, you will be more successful if you incorporate automation into your workflow from the beginning. In my experience with testbeds, the users who do not automate the deployment of their experiment have a lot harder time getting things to work and, in the end, don’t really know what is in the base environment they are using for their experiment.
May 4, 2022 at 6:35 am #1749Thank you for being such a great help! It’s just my experiment involves a huge amount of dataset and I have to redownload it every time I recreate the slice. Anyways, I think automation could be a great help and I’ll incorporate it into my workflow in no time.
Best,
Guanlong
May 4, 2022 at 9:26 am #1750Yes, the bigger and more complicated the slice, the more automation will make it easier to be resilient to any problems external to, or within, your experiment.
Longer-term FABRIC will have persistent storage for larger data sets. This is not available yet but watch out for it. This capability should make it easier to stage large data sets without relying on a persistent VMs. Generally, relying on persistent VMs for storing and serving large data sets can be unreliable.
Maybe I can help design the best way to store and serve your data. I have a few questions:
- How much data do you need to store?
- Where is the data currently stored (i.e. where do you need to copy it from?)
- How does your application consume/process the data? Does the app need all the data or just a subset? Does the app need the data locally or can it be served remotely?
I might have a few more questions but this will get us started.
thanks,
Paul
-
AuthorPosts
- You must be logged in to reply to this topic.