Unable to allocate resources after the updates/maintenance.

Tagged: resource allocation

This topic has 10 replies, 5 voices, and was last updated 1 year, 2 months ago by Ilya Baldin.

Viewing 11 posts - 1 through 11 (of 11 total)

Author

Posts
January 30, 2023 at 6:55 pm #3729
Manas Das
Participant
Hi, for our experiments we used to create a cluster of 7-8 VMs with 20-cores, 128GB-ram, 500GB-disk for each VM. Now after the updates/maintenance of last week (23-25th Jan 2023), we are not able to create such cluster. After doing list_sites(), the amount of resources that we want is showing to be available, but not able to create a slice out of it. We tried most of the sites but with no results.

Suppose we want to create a cluster of 7 VM with the above mentioned configuration, then 3-4 nodes will have the following error message:

“failed lease update- all units failed priming: Exception during create for unit: 01b29815-06fe-4b8a-a813-0bd9a0223ef2 Playbook has failed tasks: Error in creating the server (no further information available)#all units failed priming: Exception during create for unit: 01b29815-06fe-4b8a-a813-0bd9a0223ef2 Playbook has failed tasks: Error in creating the server (no further information available)#”

Please advise what we should do, since we used to create such a slice previously.

Thank you.
January 30, 2023 at 9:00 pm #3733
Komal Thareja
Participant
Hello,

I was debugging this and noticed that VMs requested by your slice are being mapped to capacities indicated below which seem to exhaust the worker where your request lands. The capacity mapping from requested to allocated seems strange.
I am unable to reproduce this. Could you please share your notebook?

Also, could you please share the output of command: pip3 list | grep fabric from your environment?
We expect the following versions to be present.
```
fabric                        3.0.0
fabric-credmgr-client         1.3.2
fabric-fim                    1.4.2
fabric-fss-utils              1.4.0
fabric-orchestrator-client    1.4.3
fabrictestbed                 1.4.3
fabrictestbed-extensions      1.4.0
```
Capacity Allocations for your slice: cluster_gatk(d8bfce5c-c721-4d41-a7fa-e8d658a40a43)
```
'capacities': '{ core: 2 , ram: 8 G, disk: 10 G}',                      ===> Requested by User
'capacity_allocations': '{ core: 16 , ram: 128 G, disk: 500 G}',        ===> Allocated by Orchestrator
'capacity_hints': '{ instance_type: fabric.c16.m128.d500}'
```
Thanks,
Komal
- This reply was modified 1 year, 3 months ago by Komal Thareja.
January 30, 2023 at 9:51 pm #3739
Manas Das
Participant
@Komal Thank you for your reply. The output is similar to what is shown above:

(base) fabric@jupyter-mjdbz….:~/work$ pip3 list | grep fabric
fabric 3.0.0
fabric-credmgr-client 1.3.2
fabric-fim 1.4.2
fabric-fss-utils 1.4.0
fabric-orchestrator-client 1.4.3
fabrictestbed 1.4.3
fabrictestbed-extensions 1.4.0

The notebook is shared with you.

Thanking you.

Manas
- This reply was modified 1 year, 3 months ago by Manas Das.
- This reply was modified 1 year, 3 months ago by Manas Das.
January 31, 2023 at 9:46 am #3747
Komal Thareja
Participant
@Manas – It looks like you are requesting: Cores=20, RAM=128GB, Disk=2000GB. Even though we may have an overall storage available to account for such a request, some of your VMs are causing the Worker to be exhausted. Specifically Disk requested cannot be served.

I was able to create this slice with Disk usage set to 500GB instead on SALT but your storage volume doesn’t exist there. Could you please try this slice with lower disk?
January 31, 2023 at 12:02 pm #3751
Manas Das
Participant
@Komal Thank you for the update. As you mentioned I tried with lower disk capacity i.e 500GB, it is working. But for our experiments we need much higher disk space. I think there are three strategies going forward:

1) Please somehow resolve the issue so that the experiment can have higher disk capacity (2000GB) for each VM. As disk space is available.

2) Allow more VMs to be created on a single site like 14-16 ( within the limits of site’s max resource capacity for cores and memory) with 500GB disk space.

3) As a temporary fix please allow 1000GB slice instances, if disk capacity is the problem.

If you have any other strategy in your mind please let us know, and please resolve our issue, we are just stuck with the project now.

Thanking you

Manas
January 31, 2023 at 12:22 pm #3752
Komal Thareja
Participant
@Manas – Is it possible for you to span your slice across multiple sites instead of single site and use other network services FabNet or L2STS or L2PTP?

In the meanwhile, I will discuss this within our team and share our approach forward.
January 31, 2023 at 12:31 pm #3753
Paul Ruth
Keymaster
@Manas –

Can you try using the NVMe drives? They are 1 TB each and you can have multiple per VM. Like all the other components, you can only create VMs composed of components that are on the same physical host. So, just because a site has 10+ NVMe drives does not mean you can put them all on one VM. Two NVMe drives in a VMs is possible on most sites. The other bonus of the NVMe drives is that they are very fast.

Also, you might try using large persistent volumes. These can be very large but are mounted across a network but within a site. You would need to pick a few site where we can create the volumes. Then you can mount them with VMs on that site. The bonus with these volumes is that the data is persistent. So, if you shutdown a slice and come back tomorrow or next week, the data will still be there.

Paul
- This reply was modified 1 year, 3 months ago by Paul Ruth.
January 31, 2023 at 5:19 pm #3763
Manas Das
Participant
@Komal thank you once again, yes we can do multisite experiments, but we are currently focused on a single site for many reasons. Please discuss the above problem with the fabric team and please let us know the outcome, eagerly waiting for it.

With Regards,

Manas
January 31, 2023 at 5:36 pm #3764
Manas Das
Participant
@Paul Thank you for replying to the thread, yes all your points are valid, we may have to request for NVMe permission. About the persistence storage we have 1TB disks in 3 sites, it is storing data for the experiment. But we need more storage per VM because of the nature of the experiment. Or if there is a way to increase the number of VMs to say 14-16 (Cores-20, ram-128, disk-500) will also solve our problem.

The disk space is available per site (more than our requirement) . I don’t know why it is not getting added to a slice, please look into it.

Thank you
February 17, 2023 at 1:04 pm #3871
Praveen Rao
Participant
We could attach one NVMe drive to each VM but can only get 6 VMs in our cluster. We need at least 16 VMs with 1TB storage (on each VM) to run the genomics experiments. Could you please suggest how we can get this done?

Note that we can do large-scale experiments on CloudLab with 16 bare-metal nodes. And we want to reproduce our previous results on FABRIC. Thanks a lot.
February 21, 2023 at 11:51 am #3875
Ilya Baldin
Participant
Praveen (and the team), just to close the loop and post a version of my private reply:

Individual FABRIC sites are not as large as CloudLab. They typically have between 3 and 6 worker nodes. Each worker has 64 cores. If you ask for VMs of more than 32 cores, that means at most one VM can be accommodated by a worker node. For your storage requirements I suspect you should rely on persistent storage in some cases – not every worker internal storage is the same, so some combinations of core/ram/disk are not possible on all workers, just some. We can create multiple persistent volumes for you on each site if required.

Another alternative is to use a combination of resources from FABRIC and other testbeds. Chameleon@Chicago is already reachable and we will be shortly adding access to Chameleon@TACC (a much larger installation) as well as CloudLab@Utah, Wisconsin and Clemson locations.
Author

Posts

Viewing 11 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic.