Takes long time for complete the Fablib API

This topic has 11 replies, 6 voices, and was last updated 2 years, 11 months ago by James McCauley.

Viewing 12 posts - 1 through 12 (of 12 total)

Author

Posts
September 16, 2022 at 6:01 pm #3088
Xusheng Ai
Participant
Hello,

This afternoon, when I ran slice.delete() command, it took much longer time than it used to be. Also, when I tried to submit a slice request, it took more than 20 mins. I noticed that there is some changing happening to Fabric. I was wondering if that is the case and if there is a way reducing the impact.

Thanks,
Best Regards,
Xusheng
September 16, 2022 at 6:08 pm #3089
Manas Das
Participant
Hello, It may be due to the change in system.

FABRIC Production Infrastructure Instability this weekend 09/16-09/18
September 16, 2022 at 7:50 pm #3093
Brandon Rice
Participant
@Xusheng Manas is likely correct.

The FABRIC development team ran a load test this morning to test scaling of the framework (i.e. having 100s of people all try to submit slices at the same time). This broke/slowed some things throughout the day. Hopefully, by end-of-day Sunday, FABRIC will be improved with the changes mentioned in the post and be back to being stable.
September 16, 2022 at 7:52 pm #3094
Xusheng Ai
Participant
Thanks for the information!
August 7, 2023 at 10:38 pm #4949
Amy Babay
Participant
Hi,

I’ve just started using the JupyterHub, and it seems that API calls are taking longer to complete than I would expect. For example, a single “fablib.get_random_site()” call takes 2-3 minutes to finish and “fablib.list_sites()” takes much longer. Is this normal? Or is there something I can do to make this better? Thanks!
August 8, 2023 at 10:36 am #4951
Ilya Baldin
Participant
These calls can take longish time simetimes (the results of calls are cached, so the first caller gets a delay, but others do not for a while), however for the past couple of days we are seeing some connectivity issues between our Jupyter Hub hosted in Google and the rest of the testbed manifesting as various connection retries which can also cause additional delays. We are investigating the reasons for it. It appears to be specific to the Jupyter Hub environment.
August 8, 2023 at 11:02 am #4952
Amy Babay
Participant
Ok, thanks! The connectivity/retry issue may be what I’m hitting, since I also was getting a few timeout errors.
August 9, 2023 at 6:48 pm #4972
James McCauley
Participant
In my experience, the FABRIC API has never been *fast*, but it’s definitely been notably slower recently in JupyterHub.

I, too, am getting retries doing various operations. Specifically, this seems to be caused by timeouts to both cm.fabric-testbed.net and orchestrator.fabric-testbed.net.

This has resulted in a notebook that reliably completed in maybe 17 minutes or so becoming one that unreliably completes in upwards of 30.

Manually making connections to orchestrator.fabric-testbed.net from the JupyterHub terminal confirms that sometimes these connections just hang (it’s hard to say exactly what’s going on since I can’t run tcpdump in this environment, but it’s certainly the case that the TCP connection isn’t getting established).

Manually making connections to orchestrator.fabric-testbed.net from elsewhere seems to reliably work just fine.

tracerouting to orchestrator.fabric-testbed.net from the JupyterHub terminal gets paths like this *most* of the time:
```
 4  ws-gw-to-hntvl-gw.ncren.net (128.109.9.22)  32.705 ms  32.858 ms  32.481 ms  32.101 ms  33.195 ms
 5  renci-to-ws-gw.ncren.net (128.109.70.174)  35.452 ms  35.676 ms  35.631 ms  35.553 ms  35.630 ms
 6  152.54.15.60 (152.54.15.60)  35.950 ms !X  35.669 ms !X  35.478 ms !X  35.570 ms !X  36.120 ms !X
```
(Line 6 is orchestrator.fabric-testbed.net)

However, there are sometimes timeouts starting at renci-to-ws-gw.ncren.net. Indeed, that line usually shows up fine or not all (all timeouts). 152.54.15.60/orchestrator shows occasional timeouts, which seem to be correlated with timeouts seen at renci-to-ws-gw.ncren.net.

When I ran the tests from elsewhere, the path didn’t go through either of these ncren routers. I didn’t see any unusual timeouts via traceroute or ping.

I don’t know what diagnosis y’all have done so far, but could this be as simple as packet loss between those two ncren routers? I can’t ping from JupyterHub and am sort of shooting in the dark, but I estimate there might be something like 3% or 4% loss there.
August 9, 2023 at 7:16 pm #4973
James McCauley
Participant
.. and *now* I see the recent announcement that you’d tracked it down to a network problem. 🙂
August 9, 2023 at 8:05 pm #4974
Ilya Baldin
Participant
Thank you for your analysis. We are about where you are – there is either a route flapping or, perhaps, some kind of non-trivial packet loss specific to the path from JH to RENCI (we have been testing with just curl to various hosts at RENCI and the results are the same). We have notified the MCNC NOC as well as UNC ITS and are waiting to see what they say.

The problem does not appear to manifest itself from the worker nodes hosting JH, only from the Dockers inside so we are thinking perhaps a middle-box somewhere that is dropping some connections specific to JH originating IPs because they have a high rate of transactions to our infrastructure compared to the background of regular IPs. But this is just a theory. We will continue our investigation and we apologize for the inconvenience.
August 10, 2023 at 10:33 am #4987
Ilya Baldin
Participant
Just updating here for completeness:

Reachability issues between JH and FABRIC infrastructure
August 10, 2023 at 3:15 pm #4990
James McCauley
Participant
To me, it looks to be back to its old self! Thank you!
Author

Posts

Viewing 12 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic.