1. manual cleanup needed?

manual cleanup needed?

Home Forums FABRIC General Questions and Discussion manual cleanup needed?

Viewing 13 posts - 1 through 13 (of 13 total)
  • Author
    Posts
  • #4565
    Fengping Hu
    Participant

      I created a slice with name ‘ServiceXSlice’ at the CERN site and then deleted it. Then tried to create the slice with the same name again. This time it gives me this error:

      redeem predecessor reservation# 22b63439-f677-436b-842a-8834035f62c6 is in a terminal state, failing the reservation# 34987a45-8f98-4cbb-a0c1-a065be03ead9#

      It seems the slice deletion may stuck and thus I can’t create a new slice. Maybe a manual cleanup is needed?  I can no longer list the slice. Please advice on what I can do in order to create a slice.

      Thanks,

      Fengping

       

      #4566
      Komal Thareja
      Participant

        Hi Fengping,

        Your second slice failed with the error: Insufficient resources as depicted below. Please note that slice deletion is not synchronous, it may take some time for all the resources associated with a slice to be deleted. Please consider adding slight delay between subsequent slice creation attempts if both the slices are requesting resources from the same site which might not have been released yet by the first slice.

        Resource Type: VM Notices: Reservation 113cd41c-26df-461e-8dc9-f93ed92fcebf (Slice ServiceXSlice(66a78e70-ecf2-41e7-be12-740561904991) Graph Id:cc871ebc-e290-4b44-ab36-046d3cd2da00 Owner:fengping@uchicago.edu) is in state (Closed,None_) (Last ticket update: Insufficient resources : ['disk'])

         

        For the second slice, you can view the failure reasons from the portal, by select the check box ‘Include Dead/Closed Slices`.

        Please try creating the slice again and let us know if you still see errors.

         

        Thanks,

        Komal

        • This reply was modified 10 months, 2 weeks ago by Komal Thareja.
        #4568
        Fengping Hu
        Participant

          Hi Komal,

          I tried to recreate the slice requesting only 100G disks but it still fails.

          The portal to show dead slice works. Now the portal lists 2 dead slices and 6 configuring slices for me. Is there a way for me to delete all of them? I wonder if these dead slices continue to hold resources from becoming available.

          Thanks,

          Fengping

           

          #4569
          Komal Thareja
          Participant

            I looked at your slices and found that you have 2 Dead Slices and 6 Closing Slices. All the slices are requesting VMs on a single site CERN. All the Slice requests are requesting either 120 or 60 cores. Regardless of the disk size, for core/ram requested these are mapped to the following flavors. Considering that there are other slices also on CERN site, your slice cannot be accommodated by single CERN site. Please consider either spanning your slice across multiple sites or reducing the size of the VM not only w.r.t disk but also cores/ram.

            We currently only have a limited number of flavors and your core/ram request is being mapped to a huge disk.

            core: 120 , ram: 480 G, ==>  fabric.c64.m384.d4000

            core: 60 , ram: 360 G,  ==> fabric.c60.m384.d2000

            NOTE: No manual cleanup is needed the software is behaving as designed.

            Thanks,

            Komal

            #4570
            Komal Thareja
            Participant

              I looked at the instance types, please try setting the core='62', ram='384', disk='100'

              FYI: https://github.com/fabric-testbed/InformationModel/blob/master/fim/slivers/data/instance_sizes.json this might be useful for VM sizing.

              Thanks,

              Komal

              #4571
              Fengping Hu
              Participant

                Hi Komal,

                The CERN is sort of dedicated for ServiceX deployment. I will need to create my slice there for data access reasons. I don’t think there should be other slices at CERN other than the ServiceX slice I created. I would like to create big vms that basically map to physically machines. So 6 VMs for 6 physical machines at CERN.

                I noticed the available CPUs are 408/768, so it’s 360 less than total which is exactly the number of cpus I requested for my slice this morning. This made me wonder if that slice is still holding up the resources. If the resources are not hold up by the dead slice but active slices, would you be able to relocate them so I can create my slice there?

                Also what resource request should I use to make vm take up a whole physical machine?

                Thanks,

                Fengping

                #4572
                Komal Thareja
                Participant

                  With the current flavor definition, I would recommend requesting VMs with the configuration:

                  cores='62', ram='384', disk='2000'

                  Anything bigger than this maps to fabric.c64.m384.d4000 and only one of the workers i.e. cern-w1 can accomodate 4TB disks and rest of the worker can at max accomodate 2TB disk. I will discuss this internally to work on providing a better flavor to accomodate your slice.

                  Thanks,

                  Komal

                  P.S: I was able to successfully create a slice with the above configuration.

                  • This reply was modified 10 months, 2 weeks ago by Komal Thareja.
                  #4574
                  Fengping Hu
                  Participant

                    Hi Komal,

                    Thanks for looking into this for me.  This config – cores=’62’, ram=’384′, disk=’2000′ indeed works to create 6 vms. But it won’t work if I try to create 12 vms even if I request half ram(192) because of flavor mapping. So yes we do need a better flavor in my case.  I may need only one big disk node to server as  a xcache node,  the rest of the nodes can have just limited disks unless we want to use all the disks to setup distributed storage(ceph etc)

                    Please let me know once you have discussed about this with your team and have recommendations. The goal is to allocate all the resources with not many vm flavors (one or two maybe).

                    Thanks,

                    Fengping

                    #4575
                    Komal Thareja
                    Participant

                      Please try this to create 12 VMs, this shall let you use almost the entire worker w.r.t cores. I will keep you posted about the flavor details.

                      
                      
                      #Create Slice
                      slice = fablib.new_slice(name=slice_name)
                      
                      # Network
                      net1 = slice.add_l2network(name=network_name, subnet=IPv4Network("192.168.1.0/24"))
                      
                      node_name = "Node"
                      number_of_nodes = 12
                      for x in range(number_of_nodes):
                        disk = 500
                        if x == 0:
                          disk = 4000
                        node = slice.add_node(name=f'{node_name}{x}', site=site, cores='62', ram='128', disk=disk)
                        iface = node.add_component(model='NIC_Basic', name='nic1').get_interfaces()[0]
                        iface.set_mode('auto')
                        net1.add_interface(iface)
                      
                      #Submit Slice Request
                      slice.submit();
                      

                      Thanks,
                      Komal

                      #4577
                      Fengping Hu
                      Participant

                        Hi Komal,

                        I tried your recipe and was able to create 10 vms with 60 cores each, but it failed to create 11 or 12 vms due to insufficient cpus. This is a bit counter intuitive since there were 766 cpus available and each of the 6 hosts should be able to run 2 vms. Nevertheless, we are in a better shape now with 600+ cores.  Thank you so much for the help. I will try the new flavor when it’s available.

                         

                        Thanks,

                        Fengping

                        #4578
                        Fengping Hu
                        Participant

                          Hi Komal,

                          It seems the slice lost public ipv6 network connection over night.  I can’t even ping the gateway. The link lost the ips I configured statically  even though I had disabled dhcp and ra for the link. So I tried to readd the ip and routes as well as tried both  network3.change_public_ip(ipv6=list(map(str,networkips[0:50]))) and network3.make_ip_publicly_routable(ipv6=list(map(str,networkips[0:50]))) to try to make the ips public. But none seemed to work.

                          Any suggestion on how to fix this network.

                          Thanks,

                          Fengping

                          Here’s the slice information and symptons

                          slice  ID
                          08d05419-e99b-4ebe-b4a1-88c07cf2bfa3
                          Name
                          ServiceXSlice

                          network id

                          06d92831-1f58-4548-9d24-9284b1273912
                          NET3
                          L3
                          FABNetv6Ext
                          CERN
                          2602:fcfb:1d:3::/64
                          2602:fcfb:1d:3::1
                          Active

                           

                          buntu@node1:~$ ping6 2602:fcfb:1d:3::1
                          PING 2602:fcfb:1d:3::1(2602:fcfb:1d:3::1) 56 data bytes
                          ^C
                          — 2602:fcfb:1d:3::1 ping statistics —
                          3 packets transmitted, 0 received, 100% packet loss, time 2056ms

                          ubuntu@node1:~$ ip -6 neigh | grep 2602
                          2602:fcfb:1d:3::7 dev ens9 lladdr 02:d2:f1:99:87:98 router REACHABLE
                          2602:fcfb:1d:3::9 dev ens9 lladdr 02:80:38:25:66:c0 router REACHABLE
                          2602:fcfb:1d:3::4 dev ens9 lladdr 02:1d:b9:31:e7:23 router STALE
                          2602:fcfb:1d:3::b dev ens9 lladdr 06:d3:95:0b:44:81 router REACHABLE
                          2602:fcfb:1d:3::6 dev ens9 lladdr 0a:b1:19:54:14:e7 router REACHABLE
                          2602:fcfb:1d:3::1 dev ens9 router FAILED

                          #4579
                          Komal Thareja
                          Participant

                            Hi Fengping,

                            Thank you so much for reporting this issue. There was a bug which led to allocating same subnet to multiple slices. So when a second slice got allocated the same subnet the traffic stopped working for your slice.

                            I have applied the fix for the bug on production. Could you please delete your slice and recreate it? Apologies for the inconvenience.

                            Appreciate your help with making the system better.

                            Thanks,
                            Komal

                            #4582
                            Fengping Hu
                            Participant

                              Hi Komal,

                              Thank you so much for looking into the issue and quick fix. I will delete the slice and recreate tomorrow.

                              Appreciate your help:)

                              Fengping

                            Viewing 13 posts - 1 through 13 (of 13 total)
                            • You must be logged in to reply to this topic.