1. Lost SSH access to some nodes after node.os_reboot()

Lost SSH access to some nodes after node.os_reboot()

Home Forums FABRIC General Questions and Discussion Lost SSH access to some nodes after node.os_reboot()

Viewing 3 posts - 1 through 3 (of 3 total)
  • Author
    Posts
  • #8556
    Sunjay Cauligi
    Participant

      Hello, I was following the examples for NUMA tuning, which have a call to node.os_reboot() after requesting CPU pinning and NUMA tuning.
      Most of my nodes have come back up, but nodes node2d2, node2d3, and node3d2 are still inaccessible via SSH.

      Slice name: ei-network-20250515140559
      Slice ID: b94cdedd-c230-4059-a172-a1ff45fd85e8

      #8559
      Mert Cevik
      Moderator

        Hello Sunjay,

        I checked the 3 VMs. I could bring the VM node3d2 online manually, but the other two did not succeed. I will suggest re-creating the slice (or modify it to re-create the VMs).

        We also need to check ourselves. Can you point out the notebook you used? I’m assuming one of the notebooks in fabric-examples github repo, but if it’s a customized notebook, please let me know, I will reach out to you via email to get the notebook (or you can attach to this thread if it’s fine for you).

        #8562
        Sunjay Cauligi
        Participant

          Hi Mert,
          I was following the NUMA steps based on this example notebook: iperf3_optimized.ipynb
          which really just boiled down to running the following:

          for node in slc.get_nodes():
              node.pin_cpu(component_name='einic')
          # one node failed to pin cpu here, investigated that with .get_cpu_info()
          for node in slc.get_nodes():
              node.numa_tune()
          # a couple nodes failed to pin memory here, investigated that with .get_numa_info()
          for node in slc.get_nodes():
              node.os_reboot()
          

          The slice I ran this on is a long-running project slice that has been up for the last two weeks. Instead of a notebook, I use a wrapper library and Python scripts directly from my local machine to provision/manage slices. I can email you my library code and an example script of my typical usage if you would like to take a look.

          This slice was provisioned with three nodes each at three different sites; each node itself is provisioned identically with a FABNETv4 (“internal”, with a 10.* IP address) NIC and a FABNETv4Ext (“public”, with a publicly routable IP address) NIC.
          I install identical software and run identical code on each node as well.

        Viewing 3 posts - 1 through 3 (of 3 total)
        • You must be logged in to reply to this topic.