1. I cannot access some of my nodes

I cannot access some of my nodes

Home Forums FABRIC General Questions and Discussion I cannot access some of my nodes

Viewing 7 posts - 1 through 7 (of 7 total)
  • Author
    Posts
  • #9735

    Dear FABRIC team,

    I hope you are doing well. I suddenly lost access to some of my nodes. I was previously able to connect, but now SSH fails with a “No route to host” error through the bastion.

    Could you please check whether there is any issue with these nodes?

    Slice ID: d7ac84d6-d791-423f-88a0-90ccacd7880a

    node name: r-2-1 Node ID: 7b4c35dd-c7d1-4d29-9ca0-c71d21e6089e

    node name: r-2-3 Node ID: c834417a-7393-4cae-bd62-722358b6451f

    Thank you.

    Best regards,

    Fatih Berkay Sarpkaya

    #9739
    Mert Cevik
    Moderator

      Both VMs were crashed. I’m attaching the console outputs.
      console.7b4c35dd-c7d1-4d29-9ca0-c71d21e6089e-r-2-1
      console.c834417a-7393-4cae-bd62-722358b6451f-r-2-3

      I restarted them, they are online. I also attached their PCI devices (IP addresses need to be re-assigned).

      #9740

      Thank you so much. I checked the nodes, and they are working now. However, I lost connection to r-4-1 this time. Could you please also check this node?

      Slice ID: d7ac84d6-d791-423f-88a0-90ccacd7880a

      node name: r-4-1 Node ID: 33186378-c0a9-48de-a382-0e78cb209d6b

      Thank you for your time.

      Best regards,

      Fatih Berkay Sarpkaya

      #9741
      Mert Cevik
      Moderator

        Same situation. Rebooted, devices attached.

        I’m not sure what is causing this, worker node is not extremely loaded, but inside the VMs there seem to be mellanox driver issues. If you share some context about the actual experiment and traffic (generated/exchanged) we can try to understand and find a way to have it sustain reliably. Otherwise, I don’t have any clues right now. You can directly reach out if you prefer.

        #9742

        Thank you so much for your help.

        We are running a multi-AS IPv6 routing experiment with multiple router VMs organized into 6 ASes and a few endpoint VMs. The routers run FRR for BGP and OSPFv3, and we use SRv6 to steer some flows along specific paths. The crashes seem to happen when we push routing/SRv6 configuration changes across all routers at once. So far, three different routers have crashed in this way: r-2-1, r-2-3, and now r-4-1.

        The console output you sent seems to point to kernel-level issues; one looked like a Mellanox driver issue, and another looked like a possible SRv6 kernel bug. We are using Ubuntu 22.04 with kernel 5.15.0-143-generic.

        I am not sure whether this is something we can reliably fix from our side, but please let me know if you have any suggestions.

        Thank you.

        Best regards,

        Fatih Berkay Sarpkaya

        #9743

        For node “r-4-1” that you have just restored, could you please check again if its connections are correct. It was connected to “r-1-4” directly, but currently, I cannot ping between these nodes, so I thought there could be an issue between their L2 links.

        Thank you for your time.

        Best regards,

        Fatih Berkay Sarpkaya

        #9744

        Hi,

        Sorry, this could be my mistake. After the reboot, the Linux interface assignment may have changed. I can now see the connection through a different interface than before the crash.

        Thank you.

        Kind regards,

        Fatih Berkay Sarpkaya

      Viewing 7 posts - 1 through 7 (of 7 total)
      • You must be logged in to reply to this topic.