Home › Forums › FABRIC General Questions and Discussion › Long running slice stability issue.
- This topic has 18 replies, 4 voices, and was last updated 9 months, 3 weeks ago by Mert Cevik.
-
AuthorPosts
-
July 17, 2023 at 4:53 pm #4702
It looks a couple of VMs(node1 and node2) of my slice lost contact. Both the management IP and a public data plane ip stopped responding to ping. Unfortunately the head node is one of the node that is lost. The slice information is attached. Before I try to rebuild the slice, is it possible to understand how the VMs got lost. Rebuilding seems to be pretty time consuming(many hours) because the node.execute would take a very long time to run, so it would be great if it’s possible to bring back the vms that somehow lost contact.
Thanks,
Fengping
ID
2d12324d-66bc-410a-8dda-3c00d1ea0d48
Name
ServiceXSlice
Lease Expiration (UTC)
2023-06-27 13:55:25 +0000
Lease Start (UTC)
2023-06-26 13:55:26 +0000
Project ID
aac04e0e-e0fe-4421-8985-068c117d7437
State
StableOKJuly 18, 2023 at 12:42 pm #4731Fengping,
We are a bit thinly staffed this week, but I will ask someone on operations team to look into it. Can you post any relevant configuration bits – what you may have done right before you lost access, post boot scripts or node.execute() scripts? The most likely cause is a change in routing inside the node that fried the default route pointing back to the management interface. Other options are possible, of course.
July 18, 2023 at 1:04 pm #47332 VMs were stopped by the hypervisor and I started them. Can you please check the status and your access?
Root cause of this problem is a known issue that we could correct on Phase-1 sites last month, but some of the Phase-2 sites did not receive this correction yet. We will find convenient time in the next few weeks, for now we will be able to help when this occurs.
July 19, 2023 at 2:12 pm #4775Thank you for getting back to me in your busy schedules. It’s great that this is an known problem and a fix will be deployed.
The restarted VMs are accessible via the management interface now. But it seems it lost all the other links(we have 3 data plane links). Is there a way to reattach those links?
Thanks,
Fengping
July 19, 2023 at 5:24 pm #4780Hello Fengping,
I have re-attached the pci devices for the VMs:
node1
andnode2
. You would need to reassign the IP addresses back on them for your links to work. Please let us know if the links are working as expected after configuring the IP addresses.Thanks,
Komal
July 20, 2023 at 11:04 am #4782Hi Komal,
I configured the IP but it seems the link is not working. Unfortunately I changed default route to use this public dataplane link before I put in the policy based routing(which should make the management interface work), so I couldn’t access the vm now.
My theory is that the name of the links probably have changed, I assumed ens9 is NET3 but it probably that has changed after reboot. I will wait to see if RA will clean the default route so I can get in and do the fix. Otherwise I can rebuild the slice if it’s too much trouble.
Thanks,
Fengping
July 20, 2023 at 1:35 pm #4785Hi Fengping,
I think
ens7 -> net1
,ens8->net3
andens9 -> net2
Please let me know once you get the public access back. I can help figure out the interfaces.Thanks,
Komal
July 20, 2023 at 2:00 pm #4786You can confirm the interfaces for Node1 and Node2 via the mac addresses:
Node1
02:7F:AE:44:CB:C9
=> NIC306:E3:D6:00:5B:06
=> NIC202:BC:A6:3F:C7:CB
=> NIC1Node2
02:15:60:C2:7A:AD
=>NIC302:1D:B9:31:E7:23
=> NIC202:B5:53:89:2C:E6
=> NIC1Thanks,
Komal
July 21, 2023 at 10:53 am #4797Hi Komal,
Thanks for looking up information for me. Can you kindly reboot both Node1 and Node2. For Node1, it seems I can’t regain access but a reboot would because that would clear the default route I added. For Node2 actually I was never able to login so hopefully a reboot can fix that as well.
Thanks,
Fengping
July 24, 2023 at 11:57 am #4802Hi Fengping,
I have rebooted both Node1 and Node2. They should be accessible now. Please set up the IPs as per the mac addresses shared above. Please do let me know if anything else is needed form my side.
Thanks,
Komal
July 24, 2023 at 4:12 pm #4810Hi Komal,
Thanks for the help. I can login to both VMs via the management interface now. I can also see the 3 dataplane links are present on both VMs. However after I configured the IP on the public dataplane network(NET3 – so NIC with mac 02:7F:AE:44:CB:C9), it can’t reach gateway or other vms in the same network. I also tried to put this IP on the other two links and none of them works. Is there another step needed to make the links get attached to those networks?
Thanks,
Fenpging
root@node1:/home/ubuntu# ip -6 route
::1 dev lo proto kernel metric 256 pref medium
2001:400:a100:3090::/64 dev ens3 proto ra metric 100 expires 86337sec pref medium
2602:fcfb:1d:2::2 dev ens7 proto kernel metric 256 pref medium
fe80::a9fe:a9fe via fe80::f816:3eff:feac:1ca0 dev ens3 proto ra metric 1024 expires 237sec pref medium
fe80::/64 dev ens9 proto kernel metric 256 pref medium
fe80::/64 dev ens3 proto kernel metric 256 pref medium
fe80::/64 dev ens8 proto kernel metric 256 pref medium
fe80::/64 dev ens7 proto kernel metric 256 pref medium
default via fe80::f816:3eff:feac:1ca0 dev ens3 proto ra metric 100 expires 237sec mtu 9000 pref medium
root@node1:/home/ubuntu# ip -6 a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 state UNKNOWN qlen 1000
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 state UP qlen 1000
inet6 2001:400:a100:3090:f816:3eff:fe1c:385f/64 scope global dynamic mngtmpaddr noprefixroute
valid_lft 86327sec preferred_lft 14327sec
inet6 fe80::f816:3eff:fe1c:385f/64 scope link
valid_lft forever preferred_lft forever
3: ens7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
inet6 2602:fcfb:1d:2::2/128 scope global
valid_lft forever preferred_lft forever
inet6 fe80::7f:aeff:fe44:cbc9/64 scope link
valid_lft forever preferred_lft forever
4: ens8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
inet6 fe80::bc:a6ff:fe3f:c7cb/64 scope link
valid_lft forever preferred_lft forever
5: ens9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
inet6 fe80::4e3:d6ff:fe00:5b06/64 scope link
valid_lft forever preferred_lft forever
root@node1:/home/ubuntu# ping6 2602:fcfb:1d:2::4
PING 2602:fcfb:1d:2::4(2602:fcfb:1d:2::4) 56 data bytes
^C
— 2602:fcfb:1d:2::4 ping statistics —
5 packets transmitted, 0 received, 100% packet loss, time 4098msJuly 24, 2023 at 5:15 pm #4811Hi Fegping,
Node1:
ens7
maps toNIC3
It was configured as below:NOTE the
prefixlen
is set to 128 instead of 64.ens7: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet6 2602:fcfb:1d:2::2 prefixlen 128 scopeid 0x0
inet6 fe80::7f:aeff:fe44:cbc9 prefixlen 64 scopeid 0x20 ether 02:7f:ae:44:cb:c9 txqueuelen 1000 (Ethernet)
RX packets 28126 bytes 2617668 (2.6 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 2581 bytes 208710 (208.7 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0I brought this interface down and re-configured the IP address using the following command:
ip -6 addr add 2602:fcfb:1d:2::2/64 dev ens7
After this I can ping the gateway as well as other nodes.
root@node1:~# ping 2602:fcfb:1d:2::4
PING 2602:fcfb:1d:2::4(2602:fcfb:1d:2::4) 56 data bytes
64 bytes from 2602:fcfb:1d:2::4: icmp_seq=1 ttl=64 time=0.186 ms
^C
--- 2602:fcfb:1d:2::4 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.186/0.186/0.186/0.000 ms
root@node1:~# ping 2602:fcfb:1d:2::1
PING 2602:fcfb:1d:2::1(2602:fcfb:1d:2::1) 56 data bytes
64 bytes from 2602:fcfb:1d:2::1: icmp_seq=1 ttl=64 time=0.555 ms
^C
--- 2602:fcfb:1d:2::1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.555/0.555/0.555/0.000 ms
Node2: IP was configured on
ens7
However, mac address for NIC302:15:60:C2:7A:AD
maps toens9
I configuredens9
with the commandip -6 addr add 2602:fcfb:1d:2::3/64 dev ens9
and can now ping gateway and other nodes.
root@node2:~# ping 2602:fcfb:1d:2::1
PING 2602:fcfb:1d:2::1(2602:fcfb:1d:2::1) 56 data bytes
64 bytes from 2602:fcfb:1d:2::1: icmp_seq=1 ttl=64 time=0.948 ms
64 bytes from 2602:fcfb:1d:2::1: icmp_seq=2 ttl=64 time=0.440 ms
^C
--- 2602:fcfb:1d:2::1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1007ms
rtt min/avg/max/mdev = 0.440/0.694/0.948/0.254 ms
root@node2:~# ping 2602:fcfb:1d:2::2
PING 2602:fcfb:1d:2::2(2602:fcfb:1d:2::2) 56 data bytes
64 bytes from 2602:fcfb:1d:2::2: icmp_seq=1 ttl=64 time=0.146 ms
64 bytes from 2602:fcfb:1d:2::2: icmp_seq=2 ttl=64 time=0.082 ms
^C
--- 2602:fcfb:1d:2::2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1010ms
rtt min/avg/max/mdev = 0.082/0.114/0.146/0.032 ms
Please configure the IPs on other interfaces or share the IPs and I can help configure them.
Thanks,
KomalJuly 24, 2023 at 5:22 pm #4814Hi Komal,
Thanks for looking into this and the note. It works nicely now!
Thanks,
Fengping
February 16, 2024 at 5:58 pm #65763 nodes in this very long running slice are down. Can some one bring the nodes back online when possible.
The nodes are node2, node3 and node4. The slice information is here just in case:
ID
2d12324d-66bc-410a-8dda-3c00d1ea0d48
Name
ServiceXSlice
Project ID
aac04e0e-e0fe-4421-8985-068c117d7437Thanks,
Fengpig
February 18, 2024 at 3:05 pm #6580Hello Fengping,
We are working on this problem. We will post updates about the VMs.
-
AuthorPosts
- You must be logged in to reply to this topic.