I cannot access some of my nodes

This topic has 6 replies, 2 voices, and was last updated 2 months, 1 week ago by Fatih Berkay Sarpkaya.

Viewing 7 posts - 1 through 7 (of 7 total)

Author

Posts
May 1, 2026 at 12:52 am #9735
Fatih Berkay Sarpkaya
Participant
Dear FABRIC team,

I hope you are doing well. I suddenly lost access to some of my nodes. I was previously able to connect, but now SSH fails with a “No route to host” error through the bastion.

Could you please check whether there is any issue with these nodes?

Slice ID: d7ac84d6-d791-423f-88a0-90ccacd7880a

node name: r-2-1 Node ID: 7b4c35dd-c7d1-4d29-9ca0-c71d21e6089e

node name: r-2-3 Node ID: c834417a-7393-4cae-bd62-722358b6451f

Thank you.

Best regards,

Fatih Berkay Sarpkaya
May 1, 2026 at 12:03 pm #9739
Mert Cevik
Moderator
Both VMs were crashed. I’m attaching the console outputs.
console.7b4c35dd-c7d1-4d29-9ca0-c71d21e6089e-r-2-1
console.c834417a-7393-4cae-bd62-722358b6451f-r-2-3

I restarted them, they are online. I also attached their PCI devices (IP addresses need to be re-assigned).
May 1, 2026 at 1:04 pm #9740
Fatih Berkay Sarpkaya
Participant
Thank you so much. I checked the nodes, and they are working now. However, I lost connection to r-4-1 this time. Could you please also check this node?

Slice ID: d7ac84d6-d791-423f-88a0-90ccacd7880a

node name: r-4-1 Node ID: 33186378-c0a9-48de-a382-0e78cb209d6b

Thank you for your time.

Best regards,

Fatih Berkay Sarpkaya
May 1, 2026 at 1:22 pm #9741
Mert Cevik
Moderator
Same situation. Rebooted, devices attached.

I’m not sure what is causing this, worker node is not extremely loaded, but inside the VMs there seem to be mellanox driver issues. If you share some context about the actual experiment and traffic (generated/exchanged) we can try to understand and find a way to have it sustain reliably. Otherwise, I don’t have any clues right now. You can directly reach out if you prefer.
May 1, 2026 at 1:37 pm #9742
Fatih Berkay Sarpkaya
Participant
Thank you so much for your help.

We are running a multi-AS IPv6 routing experiment with multiple router VMs organized into 6 ASes and a few endpoint VMs. The routers run FRR for BGP and OSPFv3, and we use SRv6 to steer some flows along specific paths. The crashes seem to happen when we push routing/SRv6 configuration changes across all routers at once. So far, three different routers have crashed in this way: r-2-1, r-2-3, and now r-4-1.

The console output you sent seems to point to kernel-level issues; one looked like a Mellanox driver issue, and another looked like a possible SRv6 kernel bug. We are using Ubuntu 22.04 with kernel 5.15.0-143-generic.

I am not sure whether this is something we can reliably fix from our side, but please let me know if you have any suggestions.

Thank you.

Best regards,

Fatih Berkay Sarpkaya
May 1, 2026 at 3:10 pm #9743
Fatih Berkay Sarpkaya
Participant
For node “r-4-1” that you have just restored, could you please check again if its connections are correct. It was connected to “r-1-4” directly, but currently, I cannot ping between these nodes, so I thought there could be an issue between their L2 links.

Thank you for your time.

Best regards,

Fatih Berkay Sarpkaya
May 1, 2026 at 3:28 pm #9744
Fatih Berkay Sarpkaya
Participant
Hi,

Sorry, this could be my mistake. After the reboot, the Linux interface assignment may have changed. I can now see the connection through a different interface than before the crash.

Thank you.

Kind regards,

Fatih Berkay Sarpkaya
Author

Posts

Viewing 7 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic.