1. Cannot SSH into NS2 and NS4 nodes, need to preserve data (PhD simulations)

Cannot SSH into NS2 and NS4 nodes, need to preserve data (PhD simulations)

Home Forums FABRIC General Questions and Discussion Cannot SSH into NS2 and NS4 nodes, need to preserve data (PhD simulations)

Tagged: 

Viewing 5 posts - 1 through 5 (of 5 total)
  • Author
    Posts
  • #9260

    I am having problems connecting to slices NS2 and NS4, and I urgently need assistance.

    🖥️ Updating node: NS4<br data-start=”207″ data-end=”210″ />⚠️ Error while processing slice ‘NS4’:<br data-start=”248″ data-end=”251″ />POA – 1f2b954c-ba98-4e62-a520-57c4ee414956/addkey failed with error:<br data-start=”323″ data-end=”326″ />“Exception during POA for unit: 527af509-fb62-41ae-a0c0-191e3c7f6525 — Playbook has failed tasks: All items completed.”

    đź”§ Accessing slice: NS2<br data-start=”609″ data-end=”612″ />🖥️ Updating node: NS2<br data-start=”634″ data-end=”637″ />⚠️ Error while processing slice ‘NS2’:<br data-start=”675″ data-end=”678″ />POA – 3592d5c3-f88b-4e82-844c-c5def5301e00/addkey failed with error:<br data-start=”750″ data-end=”753″ />“Exception during POA for unit: be7426fd-7ebe-4e6b-bc65-56e30d7e8e50 — Playbook has failed tasks: All items completed.”

    I have important simulation data stored on these nodes, and I cannot lose this data.<br data-start=”966″ data-end=”969″ />Please help me recover access to NS2 and NS4 as soon as possible.

    #9263
    Mert Cevik
    Moderator

      Hello Danilo,

      I checked both VMs (NS2 and NS4) and they are up and online. However there seems to be some changes in the SSH key(s) that are injected by the system and I cannot login to the VMs. I’m not sure about the main root cause of the exceptions that you posted, they indicate that the FABRIC orchestration system cannot perform actions on the VMs, and the issue with the SSH keys may be the main cause. We need to learn from you what might have changed on the VMs.

      I’m posting the reservation info here. Both VM reservations are valid until 12/25

       {
      "sliver_id": "be7426fd-7ebe-4e6b-bc65-56e30d7e8e50",
      "slice_id": "53cfa2bd-5110-420d-8bdb-053c1af45801",
      "type": "VM",
      "notices": "Reservation be7426fd-7ebe-4e6b-bc65-56e30d7e8e50 (Slice NS2(53cfa2bd-5110-420d-8bdb-053c1af45801) Graph Id:2aced1b1-3843-4809-b9a2-9ed9e2ff0317 Owner:daniloassis@utfpr.edu.br) is in state (Active,None_) ",
      "start": "2025-08-02 12:44:10 +0000",
      "end": "2025-12-25 18:11:58 +0000",
      "requested_end": "2025-12-25 18:11:58 +0000",
      "units": 1,
      "state": 4,
      "pending_state": 11,
      "sliver": {
      "Name": "NS2",
      "Type": "VM",
      "Capacities": "{\"core\": 16, \"disk\": 1000, \"ram\": 16}",
      "CapacityHints": "{\"instance_type\": \"fabric.c16.m16.d1000\"}",
      "CapacityAllocations": "{\"core\": 16, \"disk\": 1000, \"ram\": 16}",
      "LabelAllocations": "{\"instance\": \"instance-000040ec\", \"instance_parent\": \"star-w1.fabric-testbed.net\"}",
      "ReservationInfo": "{\"reservation_id\": \"be7426fd-7ebe-4e6b-bc65-56e30d7e8e50\", \"reservation_state\": \"Active\"}",
      "NodeMap": "[\"e2b1e451-45b4-4691-9527-aae18cad3b19\", \"BBQSH63\"]",
      "StitchNode": "false",
      "ImageRef": "default_ubuntu_22,qcow2",
      "MgmtIp": "2001:400:a100:3030:f816:3eff:fed9:14ee",
      "Site": "STAR"
      }
      }
      
       {
      "sliver_id": "527af509-fb62-41ae-a0c0-191e3c7f6525",
      "slice_id": "059939f1-19be-49ee-ab5b-bf4504639c13",
      "type": "VM",
      "notices": "Reservation 527af509-fb62-41ae-a0c0-191e3c7f6525 (Slice NS4(059939f1-19be-49ee-ab5b-bf4504639c13) Graph Id:14a3d7f7-e797-4e75-bafb-00282ea63896 Owner:daniloassis@utfpr.edu.br) is in state (Active,None_) ",
      "start": "2025-08-02 12:44:10 +0000",
      "end": "2025-12-25 18:12:40 +0000",
      "requested_end": "2025-12-25 18:12:40 +0000",
      "units": 1,
      "state": 4,
      "pending_state": 11,
      "sliver": {
      "Name": "NS4",
      "Type": "VM",
      "Capacities": "{\"core\": 16, \"disk\": 1000, \"ram\": 16}",
      "CapacityHints": "{\"instance_type\": \"fabric.c16.m16.d1000\"}",
      "CapacityAllocations": "{\"core\": 16, \"disk\": 1000, \"ram\": 16}",
      "LabelAllocations": "{\"instance\": \"instance-00000cf6\", \"instance_parent\": \"toky-w2.fabric-testbed.net\"}",
      "ReservationInfo": "{\"reservation_id\": \"527af509-fb62-41ae-a0c0-191e3c7f6525\", \"reservation_state\": \"Active\"}",
      "NodeMap": "[\"e2b1e451-45b4-4691-9527-aae18cad3b19\", \"FW696S3\"]",
      "StitchNode": "false",
      "ImageRef": "default_ubuntu_22,qcow2",
      "MgmtIp": "133.69.160.21",
      "Site": "TOKY"
      }
      }
      
      
      #9264

      Hi,

      I believe this is the same issue that was discussed and resolved by Komal in the FABRIC forum thread:

      Home › Forums › FPGAs in FABRIC ›
      “Cannot SSH into NS1 and NS5 nodes, need to preserve data (PhD simulations)”

      In that case, the authorized_keys file was found to be empty, SSH access was manually restored, and POA started working again.

      Given that NS2 and NS4 present identical symptoms, I strongly suspect the same root cause here as well. Since this was successfully resolved before, I believe the same approach could be applied.

      These nodes also contain important PhD simulation data that I cannot afford to lose.

      Best regards,
      Danilo

      #9265

      I unintentionally made the same mistake on NS1 and NS5, where I copied the /root/.ssh directory from my local machine to the remote VMs. As a result, the Nova-injected SSH keys were overwritten and SSH/POA access stopped working.

      In that case, support was able to restore the authorized_keys, and both NS1 and NS5 regained access successfully.

      Given this, I believe NS2 and NS4 are affected by the same issue, and I kindly ask if the Nova/POA SSH keys can be restored there as well.

      #9267
      yoursunny
      Participant

        Cannot SSH into NS1 and NS5 nodes, need to preserve data (PhD simulations)

        I found that the authorized_keys file on both NS1 and NS5 was empty, which is why SSH—whether through the admin key or the Control Framework—was failing resulting in POA/addKey failure. It seems this may have happened unintentionally as part of the experiment.

        Please be careful not to remove or overwrite the authorized_keys file in the process.

        Given this is a commonly occurring user error, maybe the OS images should include a separate account for Control Framework / POA access?

         

        I have important simulation data stored on these nodes, and I cannot lose this data.

        While I hope you can get the data back, you should setup automated backup for important data. FABRIC and Cloudlab machines should be considered ephemeral and are not suitable for important storage.

        I learned the importance of full backups during my PhD simulations: while I downloaded both the program code and the outcome files, I neglected to save the parameters used to launch the program. After a disk failure, I had to spend multiple weeks to reconstruct the input parameters and command lines.

      Viewing 5 posts - 1 through 5 (of 5 total)
      • You must be logged in to reply to this topic.