1. Takes long time for complete the Fablib API

Takes long time for complete the Fablib API

Home Forums FABRIC General Questions and Discussion Takes long time for complete the Fablib API

Viewing 12 posts - 1 through 12 (of 12 total)
  • Author
    Posts
  • #3088
    Xusheng Ai
    Participant

      Hello,

      This afternoon, when I ran slice.delete() command, it took much longer time than it used to be. Also, when I tried to submit a slice request, it took more than 20 mins. I noticed that there is some changing happening to Fabric. I was wondering if that is the case and if there is a way reducing the impact.

      Thanks,
      Best Regards,
      Xusheng

      #3089
      Manas Das
      Participant

        Hello, It may be due to the change in system.

        FABRIC Production Infrastructure Instability this weekend 09/16-09/18

        #3093
        Brandon Rice
        Participant

          @Xusheng Manas is likely correct.

          The FABRIC development team ran a load test this morning to test scaling of the framework (i.e. having 100s of people all try to submit slices at the same time). This broke/slowed some things throughout the day. Hopefully, by end-of-day Sunday, FABRIC will be improved with the changes mentioned in the post and be back to being stable.

          #3094
          Xusheng Ai
          Participant

            Thanks for the information!

            #4949
            Amy Babay
            Participant

              Hi,

              I’ve just started using the JupyterHub, and it seems that API calls are taking longer to complete than I would expect. For example, a single “fablib.get_random_site()” call takes 2-3 minutes to finish and “fablib.list_sites()” takes much longer. Is this normal? Or is there something I can do to make this better? Thanks!

              #4951
              Ilya Baldin
              Participant

                These calls can take longish time simetimes (the results of calls are cached, so the first caller gets a delay, but others do not for a while), however for the past couple of days we are seeing some connectivity issues between our Jupyter Hub hosted in Google and the rest of the testbed manifesting as various connection retries which can also cause additional delays. We are investigating the reasons for it. It appears to be specific to the Jupyter Hub environment.

                #4952
                Amy Babay
                Participant

                  Ok, thanks! The connectivity/retry issue may be what I’m hitting, since I also was getting a few timeout errors.

                  #4972
                  James McCauley
                  Participant

                    In my experience, the FABRIC API has never been *fast*, but it’s definitely been notably slower recently in JupyterHub.

                    I, too, am getting retries doing various operations. Specifically, this seems to be caused by timeouts to both cm.fabric-testbed.net and orchestrator.fabric-testbed.net.

                    This has resulted in a notebook that reliably completed in maybe 17 minutes or so becoming one that unreliably completes in upwards of 30.

                    Manually making connections to orchestrator.fabric-testbed.net from the JupyterHub terminal confirms that sometimes these connections just hang (it’s hard to say exactly what’s going on since I can’t run tcpdump in this environment, but it’s certainly the case that the TCP connection isn’t getting established).

                    Manually making connections to orchestrator.fabric-testbed.net from elsewhere seems to reliably work just fine.

                    tracerouting to orchestrator.fabric-testbed.net from the JupyterHub terminal gets paths like this *most* of the time:

                     4  ws-gw-to-hntvl-gw.ncren.net (128.109.9.22)  32.705 ms  32.858 ms  32.481 ms  32.101 ms  33.195 ms
                     5  renci-to-ws-gw.ncren.net (128.109.70.174)  35.452 ms  35.676 ms  35.631 ms  35.553 ms  35.630 ms
                     6  152.54.15.60 (152.54.15.60)  35.950 ms !X  35.669 ms !X  35.478 ms !X  35.570 ms !X  36.120 ms !X
                    

                    (Line 6 is orchestrator.fabric-testbed.net)

                    However, there are sometimes timeouts starting at renci-to-ws-gw.ncren.net. Indeed, that line usually shows up fine or not all (all timeouts). 152.54.15.60/orchestrator shows occasional timeouts, which seem to be correlated with timeouts seen at renci-to-ws-gw.ncren.net.

                    When I ran the tests from elsewhere, the path didn’t go through either of these ncren routers. I didn’t see any unusual timeouts via traceroute or ping.

                    I don’t know what diagnosis y’all have done so far, but could this be as simple as packet loss between those two ncren routers? I can’t ping from JupyterHub and am sort of shooting in the dark, but I estimate there might be something like 3% or 4% loss there.

                    #4973
                    James McCauley
                    Participant

                      .. and *now* I see the recent announcement that you’d tracked it down to a network problem. 🙂

                      #4974
                      Ilya Baldin
                      Participant

                        Thank you for your analysis. We are about where you are – there is either a route flapping or, perhaps, some kind of non-trivial packet loss specific to the path from JH to RENCI (we have been testing with just curl to various hosts at RENCI and the results are the same). We have notified the MCNC NOC as well as UNC ITS and are waiting to see what they say.

                        The problem does not appear to manifest itself from the worker nodes hosting JH, only from the Dockers inside so we are thinking perhaps a middle-box somewhere that is dropping some connections specific to JH originating IPs because they have a high rate of transactions to our infrastructure compared to the background of regular IPs. But this is just a theory. We will continue our investigation and we apologize for the inconvenience.

                        #4987
                        Ilya Baldin
                        Participant

                          Just updating here for completeness:

                          Reachability issues between JH and FABRIC infrastructure

                           

                          #4990
                          James McCauley
                          Participant

                            To me, it looks  to be back to its old self!  Thank you!

                          Viewing 12 posts - 1 through 12 (of 12 total)
                          • You must be logged in to reply to this topic.