Lost SSH access to some nodes after node.os_reboot()

This topic has 2 replies, 2 voices, and was last updated 5 months, 3 weeks ago by Sunjay Cauligi.

Viewing 3 posts - 1 through 3 (of 3 total)

Author

Posts
May 29, 2025 at 5:28 pm #8556
Sunjay Cauligi
Participant
Hello, I was following the examples for NUMA tuning, which have a call to node.os_reboot() after requesting CPU pinning and NUMA tuning.
Most of my nodes have come back up, but nodes node2d2, node2d3, and node3d2 are still inaccessible via SSH.

Slice name: ei-network-20250515140559
Slice ID: b94cdedd-c230-4059-a172-a1ff45fd85e8
May 29, 2025 at 8:44 pm #8559
Mert Cevik
Moderator
Hello Sunjay,

I checked the 3 VMs. I could bring the VM node3d2 online manually, but the other two did not succeed. I will suggest re-creating the slice (or modify it to re-create the VMs).

We also need to check ourselves. Can you point out the notebook you used? I’m assuming one of the notebooks in fabric-examples github repo, but if it’s a customized notebook, please let me know, I will reach out to you via email to get the notebook (or you can attach to this thread if it’s fine for you).
May 30, 2025 at 12:07 pm #8562
Sunjay Cauligi
Participant
Hi Mert,
I was following the NUMA steps based on this example notebook: iperf3_optimized.ipynb
which really just boiled down to running the following:
```
for node in slc.get_nodes():
    node.pin_cpu(component_name='einic')
# one node failed to pin cpu here, investigated that with .get_cpu_info()
for node in slc.get_nodes():
    node.numa_tune()
# a couple nodes failed to pin memory here, investigated that with .get_numa_info()
for node in slc.get_nodes():
    node.os_reboot()
```
The slice I ran this on is a long-running project slice that has been up for the last two weeks. Instead of a notebook, I use a wrapper library and Python scripts directly from my local machine to provision/manage slices. I can email you my library code and an example script of my typical usage if you would like to take a look.

This slice was provisioned with three nodes each at three different sites; each node itself is provisioned identically with a FABNETv4 (“internal”, with a 10.* IP address) NIC and a FABNETv4Ext (“public”, with a publicly routable IP address) NIC.
I install identical software and run identical code on each node as well.
- This reply was modified 5 months, 3 weeks ago by Sunjay Cauligi.
- This reply was modified 5 months, 3 weeks ago by Sunjay Cauligi.
Author

Posts

Viewing 3 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic.