Forum Replies Created
-
AuthorPosts
-
To me, it looks to be back to its old self! Thank you!
.. and *now* I see the recent announcement that you’d tracked it down to a network problem. 🙂
In my experience, the FABRIC API has never been *fast*, but it’s definitely been notably slower recently in JupyterHub.
I, too, am getting retries doing various operations. Specifically, this seems to be caused by timeouts to both cm.fabric-testbed.net and orchestrator.fabric-testbed.net.
This has resulted in a notebook that reliably completed in maybe 17 minutes or so becoming one that unreliably completes in upwards of 30.
Manually making connections to orchestrator.fabric-testbed.net from the JupyterHub terminal confirms that sometimes these connections just hang (it’s hard to say exactly what’s going on since I can’t run tcpdump in this environment, but it’s certainly the case that the TCP connection isn’t getting established).
Manually making connections to orchestrator.fabric-testbed.net from elsewhere seems to reliably work just fine.
tracerouting to orchestrator.fabric-testbed.net from the JupyterHub terminal gets paths like this *most* of the time:
4 ws-gw-to-hntvl-gw.ncren.net (128.109.9.22) 32.705 ms 32.858 ms 32.481 ms 32.101 ms 33.195 ms 5 renci-to-ws-gw.ncren.net (128.109.70.174) 35.452 ms 35.676 ms 35.631 ms 35.553 ms 35.630 ms 6 152.54.15.60 (152.54.15.60) 35.950 ms !X 35.669 ms !X 35.478 ms !X 35.570 ms !X 36.120 ms !X
(Line 6 is orchestrator.fabric-testbed.net)
However, there are sometimes timeouts starting at renci-to-ws-gw.ncren.net. Indeed, that line usually shows up fine or not all (all timeouts). 152.54.15.60/orchestrator shows occasional timeouts, which seem to be correlated with timeouts seen at renci-to-ws-gw.ncren.net.
When I ran the tests from elsewhere, the path didn’t go through either of these ncren routers. I didn’t see any unusual timeouts via traceroute or ping.
I don’t know what diagnosis y’all have done so far, but could this be as simple as packet loss between those two ncren routers? I can’t ping from JupyterHub and am sort of shooting in the dark, but I estimate there might be something like 3% or 4% loss there.
Yup, it’s working again. Thanks!
Just ran again on the Fall 2023 container (now showing FIM 1.5.4), and it appears to be back to working as expected. Thanks!
-
AuthorPosts