Cannot SSH into NS2 and NS4 nodes, need to preserve data (PhD simulations)

Tagged: ssh insue

This topic has 8 replies, 4 voices, and was last updated 2 months, 3 weeks ago by DANILO RENATO DE ASSIS.

Viewing 9 posts - 1 through 9 (of 9 total)

Author

Posts
December 12, 2025 at 2:02 pm #9260
DANILO RENATO DE ASSIS
Participant
I am having problems connecting to slices NS2 and NS4, and I urgently need assistance.

🖥️ Updating node: NS4 ⚠️ Error while processing slice ‘NS4’: POA – 1f2b954c-ba98-4e62-a520-57c4ee414956/addkey failed with error: “Exception during POA for unit: 527af509-fb62-41ae-a0c0-191e3c7f6525 — Playbook has failed tasks: All items completed.”

🔧 Accessing slice: NS2 🖥️ Updating node: NS2 ⚠️ Error while processing slice ‘NS2’: POA – 3592d5c3-f88b-4e82-844c-c5def5301e00/addkey failed with error: “Exception during POA for unit: be7426fd-7ebe-4e6b-bc65-56e30d7e8e50 — Playbook has failed tasks: All items completed.”

I have important simulation data stored on these nodes, and I cannot lose this data. Please help me recover access to NS2 and NS4 as soon as possible.
December 12, 2025 at 5:18 pm #9263
Mert Cevik
Moderator
Hello Danilo,

I checked both VMs (NS2 and NS4) and they are up and online. However there seems to be some changes in the SSH key(s) that are injected by the system and I cannot login to the VMs. I’m not sure about the main root cause of the exceptions that you posted, they indicate that the FABRIC orchestration system cannot perform actions on the VMs, and the issue with the SSH keys may be the main cause. We need to learn from you what might have changed on the VMs.

I’m posting the reservation info here. Both VM reservations are valid until 12/25
```
 {
"sliver_id": "be7426fd-7ebe-4e6b-bc65-56e30d7e8e50",
"slice_id": "53cfa2bd-5110-420d-8bdb-053c1af45801",
"type": "VM",
"notices": "Reservation be7426fd-7ebe-4e6b-bc65-56e30d7e8e50 (Slice NS2(53cfa2bd-5110-420d-8bdb-053c1af45801) Graph Id:2aced1b1-3843-4809-b9a2-9ed9e2ff0317 Owner:daniloassis@utfpr.edu.br) is in state (Active,None_) ",
"start": "2025-08-02 12:44:10 +0000",
"end": "2025-12-25 18:11:58 +0000",
"requested_end": "2025-12-25 18:11:58 +0000",
"units": 1,
"state": 4,
"pending_state": 11,
"sliver": {
"Name": "NS2",
"Type": "VM",
"Capacities": "{\"core\": 16, \"disk\": 1000, \"ram\": 16}",
"CapacityHints": "{\"instance_type\": \"fabric.c16.m16.d1000\"}",
"CapacityAllocations": "{\"core\": 16, \"disk\": 1000, \"ram\": 16}",
"LabelAllocations": "{\"instance\": \"instance-000040ec\", \"instance_parent\": \"star-w1.fabric-testbed.net\"}",
"ReservationInfo": "{\"reservation_id\": \"be7426fd-7ebe-4e6b-bc65-56e30d7e8e50\", \"reservation_state\": \"Active\"}",
"NodeMap": "[\"e2b1e451-45b4-4691-9527-aae18cad3b19\", \"BBQSH63\"]",
"StitchNode": "false",
"ImageRef": "default_ubuntu_22,qcow2",
"MgmtIp": "2001:400:a100:3030:f816:3eff:fed9:14ee",
"Site": "STAR"
}
}

 {
"sliver_id": "527af509-fb62-41ae-a0c0-191e3c7f6525",
"slice_id": "059939f1-19be-49ee-ab5b-bf4504639c13",
"type": "VM",
"notices": "Reservation 527af509-fb62-41ae-a0c0-191e3c7f6525 (Slice NS4(059939f1-19be-49ee-ab5b-bf4504639c13) Graph Id:14a3d7f7-e797-4e75-bafb-00282ea63896 Owner:daniloassis@utfpr.edu.br) is in state (Active,None_) ",
"start": "2025-08-02 12:44:10 +0000",
"end": "2025-12-25 18:12:40 +0000",
"requested_end": "2025-12-25 18:12:40 +0000",
"units": 1,
"state": 4,
"pending_state": 11,
"sliver": {
"Name": "NS4",
"Type": "VM",
"Capacities": "{\"core\": 16, \"disk\": 1000, \"ram\": 16}",
"CapacityHints": "{\"instance_type\": \"fabric.c16.m16.d1000\"}",
"CapacityAllocations": "{\"core\": 16, \"disk\": 1000, \"ram\": 16}",
"LabelAllocations": "{\"instance\": \"instance-00000cf6\", \"instance_parent\": \"toky-w2.fabric-testbed.net\"}",
"ReservationInfo": "{\"reservation_id\": \"527af509-fb62-41ae-a0c0-191e3c7f6525\", \"reservation_state\": \"Active\"}",
"NodeMap": "[\"e2b1e451-45b4-4691-9527-aae18cad3b19\", \"FW696S3\"]",
"StitchNode": "false",
"ImageRef": "default_ubuntu_22,qcow2",
"MgmtIp": "133.69.160.21",
"Site": "TOKY"
}
}
```
December 12, 2025 at 6:09 pm #9264
DANILO RENATO DE ASSIS
Participant
Hi,

I believe this is the same issue that was discussed and resolved by Komal in the FABRIC forum thread:

Home › Forums › FPGAs in FABRIC ›
“Cannot SSH into NS1 and NS5 nodes, need to preserve data (PhD simulations)”

In that case, the authorized_keys file was found to be empty, SSH access was manually restored, and POA started working again.

Given that NS2 and NS4 present identical symptoms, I strongly suspect the same root cause here as well. Since this was successfully resolved before, I believe the same approach could be applied.

These nodes also contain important PhD simulation data that I cannot afford to lose.

Best regards,
Danilo
December 12, 2025 at 7:06 pm #9265
DANILO RENATO DE ASSIS
Participant
I unintentionally made the same mistake on NS1 and NS5, where I copied the /root/.ssh directory from my local machine to the remote VMs. As a result, the Nova-injected SSH keys were overwritten and SSH/POA access stopped working.

In that case, support was able to restore the authorized_keys, and both NS1 and NS5 regained access successfully.

Given this, I believe NS2 and NS4 are affected by the same issue, and I kindly ask if the Nova/POA SSH keys can be restored there as well.
- This reply was modified 2 months, 3 weeks ago by DANILO RENATO DE ASSIS.
December 12, 2025 at 10:27 pm #9267
yoursunny
Participant
Cannot SSH into NS1 and NS5 nodes, need to preserve data (PhD simulations)

I found that the authorized_keys file on both NS1 and NS5 was empty, which is why SSH—whether through the admin key or the Control Framework—was failing resulting in POA/addKey failure. It seems this may have happened unintentionally as part of the experiment.

Please be careful not to remove or overwrite the authorized_keys file in the process.

Given this is a commonly occurring user error, maybe the OS images should include a separate account for Control Framework / POA access?

I have important simulation data stored on these nodes, and I cannot lose this data.

While I hope you can get the data back, you should setup automated backup for important data. FABRIC and Cloudlab machines should be considered ephemeral and are not suitable for important storage.

I learned the importance of full backups during my PhD simulations: while I downloaded both the program code and the outcome files, I neglected to save the parameters used to launch the program. After a disk failure, I had to spend multiple weeks to reconstruct the input parameters and command lines.
December 13, 2025 at 10:57 am #9269
Komal Thareja
Participant
Hello Danilo,

I’ve restored the keys used by the Control Framework. You should now be able to add your keys via POA.

Please be careful not to overwrite any existing keys, and make sure to take a backup of your data beforehand.

@yoursunny — great suggestion. So far, we’ve avoided building our own images to reduce additional effort, but we’ll explore ways to either avoid this altogether or introduce a new user without requiring custom OS images.

Best regards,
Komal
December 14, 2025 at 8:27 pm #9271
DANILO RENATO DE ASSIS
Participant
for NS2 and Ns4, it ok. Can you do the same for NS6?
December 14, 2025 at 11:00 pm #9272
Komal Thareja
Participant
Hi,

I’ve fixed NS6 as well. Please try to update your experiment scripts to avoid overwriting the authorized_keys file in the future.

Best,
Komal
December 18, 2025 at 6:57 am #9304
DANILO RENATO DE ASSIS
Participant
Hi,

Thank you very much for fixing the machines. I will be more careful to avoid overwriting the keys again.
Author

Posts

Viewing 9 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic.