1. Error in creating a cluster with multiple nodes

Error in creating a cluster with multiple nodes

Home Forums FABRIC General Questions and Discussion Error in creating a cluster with multiple nodes

Viewing 10 posts - 1 through 10 (of 10 total)
  • Author
    Posts
  • #6435
    Manas Das
    Participant

      Hello Fabric team,

      I want to create a cluster of machines and whenever I am trying to create the cluster the following error is there:

      redeem predecessor reservation# de323d01-977a-45c5-9f99-d7f72f01cc0b is in a terminal state, failing the reservation# 6c92e8a8-e2ae-4728-aebf-310edf1599be#

      If you need any more information I am happy to provide. Thank you.

      #6437
      Komal Thareja
      Participant

        Hi Manas,

        Thank you for sharing your observations. I see one of the VMs failed to provision on UCSD and resulted in rest of the slivers being closed by orchestrator. Could you please try creating your slice again or share your notebook? I don’t have enough information to debug this further and would like to reproduce in our environment.

        Appreciate your help with this!

        Thanks,

        Komal

        #6449
        Manas Das
        Participant

          I tried different sites too but still the same issue. I have shared a notebook with you, please look into it.

          #6450
          Komal Thareja
          Participant

            Hi Manas,

            I tried your notebook and was able to figure out the issue. You are specifically passing in the flavor names in your notebook. We do not recommend that, instead we request the user to pass in the specific cores, ram and disk needed.

            In release 1.6, the underlying flavors were re-provisioned to allow for more disk/ram/core combinations which resulted in your slivers being closed due to incorrect configuration.

            Making following changes in your notebook to explicitly pass the cores, ram and disk resolves this issue.

            I have also created a BUG on Control Framework software to return more informative error in such cases for easier debugging. Thank you for reporting this and helping us make the testbed better.


            node=slice.add_node(name=node_names,
            site=site,
            #instance_type=instance_master,
            cores=8,
            ram=12,
            disk=500,
            image=image)

            NOTE: I have emailed you the updated notebook.

            P.S: I am still looking at Redeem Timeout issues if you run into that, will share an update regarding those tomorrow.

            Thanks,

            Komal

            #6451
            Manas Das
            Participant

              Hello Komal,

              Thank you for the quick response. I tried again with the updated notebook. Hosts are getting ticketed but in the network part still the same error “redeem predecessor reservation”. Please look into it, hope the issue gets resolved soon.

              Please update me after the issue is resolved.

              Thank you once again,

              Manas

              #6465
              Komal Thareja
              Participant

                Hi Manas,

                 

                I sent you updated notebook with the email. Some more improvements from the last version. Bit easier network configuration. I was able to create a slice on UCSD with this. Could you please try this and let me know how it goes?

                Thanks and Regards,

                Komal

                #6471
                Manas Das
                Participant

                  Hello Komal,

                  Thank you for the update! I am able to create a slice but now I am running into a different problem. This is also an old problem that I face from time to time. The Jupyter is timed out and in the fabric portal the slice status is StableOK. When I click on a node few of the nodes are showing the management IPs and few are not. I think the slice may be stableOK, but not fully instantiated with all the requested resources. The slice ID is : 5c551f62-ead4-4f2e-b91d-efe8c34e032e

                  Please look into it. Thank you for your time, really appreciate it.

                  Regards,

                  Manas

                  #6473
                  Komal Thareja
                  Participant

                    Hi Manas,

                    I noticed some of your slivers are in Closed state with the error: Last ticket update: Redeem/Ticket timeout

                    I applied a patch to address this last night. Could you please delete this slice and try again? We are monitoring the system to see if the patch addresses the issue. As of now, we do not have a consistent way to reproduce this problem. Please keep us informed if you run into this again.

                    Thanks,

                    Komal

                    #6481
                    Manas Das
                    Participant

                      Hello Komal,

                      I experimented with it, but sorry to say the problem persists. As you said it is not consistent, right now the experiment is running after a few tries.

                      Regards,

                      Manas

                      #6499
                      Komal Thareja
                      Participant

                        Hi Manas,

                        Thank your for bearing with me. I think I finally have a fix for the issue. I have applied the patch on UCSD and STAR for now and have not been able to reproduce it there.

                        I would appreciate if you also try at these two sites and share your observation. Hopefully it works consistently now.

                        Appreciate your help!

                        Thanks,

                        Komal

                      Viewing 10 posts - 1 through 10 (of 10 total)
                      • You must be logged in to reply to this topic.