manual cleanup needed?

This topic has 12 replies, 2 voices, and was last updated 2 years, 5 months ago by Fengping Hu.

Viewing 13 posts - 1 through 13 (of 13 total)

Author

Posts
June 20, 2023 at 11:55 am #4565
Fengping Hu
Participant
I created a slice with name ‘ServiceXSlice’ at the CERN site and then deleted it. Then tried to create the slice with the same name again. This time it gives me this error:

redeem predecessor reservation# 22b63439-f677-436b-842a-8834035f62c6 is in a terminal state, failing the reservation# 34987a45-8f98-4cbb-a0c1-a065be03ead9#

It seems the slice deletion may stuck and thus I can’t create a new slice. Maybe a manual cleanup is needed? I can no longer list the slice. Please advice on what I can do in order to create a slice.

Thanks,

Fengping
June 20, 2023 at 1:45 pm #4566
Komal Thareja
Participant
Hi Fengping,

Your second slice failed with the error: Insufficient resources as depicted below. Please note that slice deletion is not synchronous, it may take some time for all the resources associated with a slice to be deleted. Please consider adding slight delay between subsequent slice creation attempts if both the slices are requesting resources from the same site which might not have been released yet by the first slice.

Resource Type: VM Notices: Reservation 113cd41c-26df-461e-8dc9-f93ed92fcebf (Slice ServiceXSlice(66a78e70-ecf2-41e7-be12-740561904991) Graph Id:cc871ebc-e290-4b44-ab36-046d3cd2da00 Owner:fengping@uchicago.edu) is in state (Closed,None_) (Last ticket update: Insufficient resources : ['disk'])

For the second slice, you can view the failure reasons from the portal, by select the check box ‘Include Dead/Closed Slices`.

Please try creating the slice again and let us know if you still see errors.

Thanks,

Komal
- This reply was modified 2 years, 5 months ago by Komal Thareja.
June 20, 2023 at 2:09 pm #4568
Fengping Hu
Participant
Hi Komal,

I tried to recreate the slice requesting only 100G disks but it still fails.

The portal to show dead slice works. Now the portal lists 2 dead slices and 6 configuring slices for me. Is there a way for me to delete all of them? I wonder if these dead slices continue to hold resources from becoming available.

Thanks,

Fengping
June 20, 2023 at 2:45 pm #4569
Komal Thareja
Participant
I looked at your slices and found that you have 2 Dead Slices and 6 Closing Slices. All the slices are requesting VMs on a single site CERN. All the Slice requests are requesting either 120 or 60 cores. Regardless of the disk size, for core/ram requested these are mapped to the following flavors. Considering that there are other slices also on CERN site, your slice cannot be accommodated by single CERN site. Please consider either spanning your slice across multiple sites or reducing the size of the VM not only w.r.t disk but also cores/ram.

We currently only have a limited number of flavors and your core/ram request is being mapped to a huge disk.

core: 120 , ram: 480 G, ==> fabric.c64.m384.d4000

core: 60 , ram: 360 G, ==> fabric.c60.m384.d2000

NOTE: No manual cleanup is needed the software is behaving as designed.

Thanks,

Komal
June 20, 2023 at 2:57 pm #4570
Komal Thareja
Participant
I looked at the instance types, please try setting the core='62', ram='384', disk='100'

FYI: https://github.com/fabric-testbed/InformationModel/blob/master/fim/slivers/data/instance_sizes.json this might be useful for VM sizing.

Thanks,

Komal
June 20, 2023 at 3:03 pm #4571
Fengping Hu
Participant
Hi Komal,

The CERN is sort of dedicated for ServiceX deployment. I will need to create my slice there for data access reasons. I don’t think there should be other slices at CERN other than the ServiceX slice I created. I would like to create big vms that basically map to physically machines. So 6 VMs for 6 physical machines at CERN.

I noticed the available CPUs are 408/768, so it’s 360 less than total which is exactly the number of cpus I requested for my slice this morning. This made me wonder if that slice is still holding up the resources. If the resources are not hold up by the dead slice but active slices, would you be able to relocate them so I can create my slice there?

Also what resource request should I use to make vm take up a whole physical machine?

Thanks,

Fengping
June 20, 2023 at 3:32 pm #4572
Komal Thareja
Participant
With the current flavor definition, I would recommend requesting VMs with the configuration:

cores='62', ram='384', disk='2000'

Anything bigger than this maps to fabric.c64.m384.d4000 and only one of the workers i.e. cern-w1 can accomodate 4TB disks and rest of the worker can at max accomodate 2TB disk. I will discuss this internally to work on providing a better flavor to accomodate your slice.

Thanks,

Komal

P.S: I was able to successfully create a slice with the above configuration.
- This reply was modified 2 years, 5 months ago by Komal Thareja.
June 20, 2023 at 4:47 pm #4574
Fengping Hu
Participant
Hi Komal,

Thanks for looking into this for me. This config – cores=’62’, ram=’384′, disk=’2000′ indeed works to create 6 vms. But it won’t work if I try to create 12 vms even if I request half ram(192) because of flavor mapping. So yes we do need a better flavor in my case. I may need only one big disk node to server as a xcache node, the rest of the nodes can have just limited disks unless we want to use all the disks to setup distributed storage(ceph etc)

Please let me know once you have discussed about this with your team and have recommendations. The goal is to allocate all the resources with not many vm flavors (one or two maybe).

Thanks,

Fengping
June 20, 2023 at 5:19 pm #4575
Komal Thareja
Participant
Please try this to create 12 VMs, this shall let you use almost the entire worker w.r.t cores. I will keep you posted about the flavor details.
```
#Create Slice
slice = fablib.new_slice(name=slice_name)

# Network
net1 = slice.add_l2network(name=network_name, subnet=IPv4Network("192.168.1.0/24"))

node_name = "Node"
number_of_nodes = 12
for x in range(number_of_nodes):
  disk = 500
  if x == 0:
    disk = 4000
  node = slice.add_node(name=f'{node_name}{x}', site=site, cores='62', ram='128', disk=disk)
  iface = node.add_component(model='NIC_Basic', name='nic1').get_interfaces()[0]
  iface.set_mode('auto')
  net1.add_interface(iface)

#Submit Slice Request
slice.submit();
```
Thanks,
Komal
June 21, 2023 at 8:43 pm #4577
Fengping Hu
Participant
Hi Komal,

I tried your recipe and was able to create 10 vms with 60 cores each, but it failed to create 11 or 12 vms due to insufficient cpus. This is a bit counter intuitive since there were 766 cpus available and each of the 6 hosts should be able to run 2 vms. Nevertheless, we are in a better shape now with 600+ cores. Thank you so much for the help. I will try the new flavor when it’s available.

Thanks,

Fengping
June 22, 2023 at 1:19 pm #4578
Fengping Hu
Participant
Hi Komal,

It seems the slice lost public ipv6 network connection over night. I can’t even ping the gateway. The link lost the ips I configured statically even though I had disabled dhcp and ra for the link. So I tried to readd the ip and routes as well as tried both network3.change_public_ip(ipv6=list(map(str,networkips[0:50]))) and network3.make_ip_publicly_routable(ipv6=list(map(str,networkips[0:50]))) to try to make the ips public. But none seemed to work.

Any suggestion on how to fix this network.

Thanks,

Fengping

Here’s the slice information and symptons

slice ID
08d05419-e99b-4ebe-b4a1-88c07cf2bfa3
Name
ServiceXSlice

network id

06d92831-1f58-4548-9d24-9284b1273912
NET3
L3
FABNetv6Ext
CERN
2602:fcfb:1d:3::/64
2602:fcfb:1d:3::1
Active

buntu@node1:~$ ping6 2602:fcfb:1d:3::1
PING 2602:fcfb:1d:3::1(2602:fcfb:1d:3::1) 56 data bytes
^C
— 2602:fcfb:1d:3::1 ping statistics —
3 packets transmitted, 0 received, 100% packet loss, time 2056ms

ubuntu@node1:~$ ip -6 neigh | grep 2602
2602:fcfb:1d:3::7 dev ens9 lladdr 02:d2:f1:99:87:98 router REACHABLE
2602:fcfb:1d:3::9 dev ens9 lladdr 02:80:38:25:66:c0 router REACHABLE
2602:fcfb:1d:3::4 dev ens9 lladdr 02:1d:b9:31:e7:23 router STALE
2602:fcfb:1d:3::b dev ens9 lladdr 06:d3:95:0b:44:81 router REACHABLE
2602:fcfb:1d:3::6 dev ens9 lladdr 0a:b1:19:54:14:e7 router REACHABLE
2602:fcfb:1d:3::1 dev ens9 router FAILED
June 22, 2023 at 7:05 pm #4579
Komal Thareja
Participant
Hi Fengping,

Thank you so much for reporting this issue. There was a bug which led to allocating same subnet to multiple slices. So when a second slice got allocated the same subnet the traffic stopped working for your slice.

I have applied the fix for the bug on production. Could you please delete your slice and recreate it? Apologies for the inconvenience.

Appreciate your help with making the system better.

Thanks,
Komal
June 22, 2023 at 10:57 pm #4582
Fengping Hu
Participant
Hi Komal,

Thank you so much for looking into the issue and quick fix. I will delete the slice and recreate tomorrow.

Appreciate your help:)

Fengping
Author

Posts

Viewing 13 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic.