Long running slice stability issue.

This topic has 18 replies, 4 voices, and was last updated 1 month, 3 weeks ago by Mert Cevik.

Viewing 15 posts - 1 through 15 (of 19 total)

1 2 →

Author

Posts
July 17, 2023 at 4:53 pm #4702
Fengping Hu
Participant
It looks a couple of VMs(node1 and node2) of my slice lost contact. Both the management IP and a public data plane ip stopped responding to ping. Unfortunately the head node is one of the node that is lost. The slice information is attached. Before I try to rebuild the slice, is it possible to understand how the VMs got lost. Rebuilding seems to be pretty time consuming(many hours) because the node.execute would take a very long time to run, so it would be great if it’s possible to bring back the vms that somehow lost contact.

Thanks,

Fengping

ID
2d12324d-66bc-410a-8dda-3c00d1ea0d48
Name
ServiceXSlice
Lease Expiration (UTC)
2023-06-27 13:55:25 +0000
Lease Start (UTC)
2023-06-26 13:55:26 +0000
Project ID
aac04e0e-e0fe-4421-8985-068c117d7437
State
StableOK
July 18, 2023 at 12:42 pm #4731
Ilya Baldin
Participant
Fengping,

We are a bit thinly staffed this week, but I will ask someone on operations team to look into it. Can you post any relevant configuration bits – what you may have done right before you lost access, post boot scripts or node.execute() scripts? The most likely cause is a change in routing inside the node that fried the default route pointing back to the management interface. Other options are possible, of course.
July 18, 2023 at 1:04 pm #4733
Mert Cevik
Moderator
2 VMs were stopped by the hypervisor and I started them. Can you please check the status and your access?

Root cause of this problem is a known issue that we could correct on Phase-1 sites last month, but some of the Phase-2 sites did not receive this correction yet. We will find convenient time in the next few weeks, for now we will be able to help when this occurs.
July 19, 2023 at 2:12 pm #4775
Fengping Hu
Participant
Thank you for getting back to me in your busy schedules. It’s great that this is an known problem and a fix will be deployed.

The restarted VMs are accessible via the management interface now. But it seems it lost all the other links(we have 3 data plane links). Is there a way to reattach those links?

Thanks,

Fengping
July 19, 2023 at 5:24 pm #4780
Komal Thareja
Participant
Hello Fengping,

I have re-attached the pci devices for the VMs: node1and node2. You would need to reassign the IP addresses back on them for your links to work. Please let us know if the links are working as expected after configuring the IP addresses.

Thanks,

Komal
July 20, 2023 at 11:04 am #4782
Fengping Hu
Participant
Hi Komal,

I configured the IP but it seems the link is not working. Unfortunately I changed default route to use this public dataplane link before I put in the policy based routing(which should make the management interface work), so I couldn’t access the vm now.

My theory is that the name of the links probably have changed, I assumed ens9 is NET3 but it probably that has changed after reboot. I will wait to see if RA will clean the default route so I can get in and do the fix. Otherwise I can rebuild the slice if it’s too much trouble.

Thanks,

Fengping
July 20, 2023 at 1:35 pm #4785
Komal Thareja
Participant
Hi Fengping,

I think ens7 -> net1, ens8->net3 and ens9 -> net2 Please let me know once you get the public access back. I can help figure out the interfaces.

Thanks,

Komal
July 20, 2023 at 2:00 pm #4786
Komal Thareja
Participant
You can confirm the interfaces for Node1 and Node2 via the mac addresses:

Node1

02:7F:AE:44:CB:C9 => NIC3

06:E3:D6:00:5B:06=> NIC2

02:BC:A6:3F:C7:CB=> NIC1

Node2

02:15:60:C2:7A:AD=>NIC3

02:1D:B9:31:E7:23=> NIC2

02:B5:53:89:2C:E6=> NIC1

Thanks,

Komal
July 21, 2023 at 10:53 am #4797
Fengping Hu
Participant
Hi Komal,

Thanks for looking up information for me. Can you kindly reboot both Node1 and Node2. For Node1, it seems I can’t regain access but a reboot would because that would clear the default route I added. For Node2 actually I was never able to login so hopefully a reboot can fix that as well.

Thanks,

Fengping
July 24, 2023 at 11:57 am #4802
Komal Thareja
Participant
Hi Fengping,

I have rebooted both Node1 and Node2. They should be accessible now. Please set up the IPs as per the mac addresses shared above. Please do let me know if anything else is needed form my side.

Thanks,

Komal
July 24, 2023 at 4:12 pm #4810
Fengping Hu
Participant
Hi Komal,

Thanks for the help. I can login to both VMs via the management interface now. I can also see the 3 dataplane links are present on both VMs. However after I configured the IP on the public dataplane network(NET3 – so NIC with mac 02:7F:AE:44:CB:C9), it can’t reach gateway or other vms in the same network. I also tried to put this IP on the other two links and none of them works. Is there another step needed to make the links get attached to those networks?

Thanks,

Fenpging

root@node1:/home/ubuntu# ip -6 route
::1 dev lo proto kernel metric 256 pref medium
2001:400:a100:3090::/64 dev ens3 proto ra metric 100 expires 86337sec pref medium
2602:fcfb:1d:2::2 dev ens7 proto kernel metric 256 pref medium
fe80::a9fe:a9fe via fe80::f816:3eff:feac:1ca0 dev ens3 proto ra metric 1024 expires 237sec pref medium
fe80::/64 dev ens9 proto kernel metric 256 pref medium
fe80::/64 dev ens3 proto kernel metric 256 pref medium
fe80::/64 dev ens8 proto kernel metric 256 pref medium
fe80::/64 dev ens7 proto kernel metric 256 pref medium
default via fe80::f816:3eff:feac:1ca0 dev ens3 proto ra metric 100 expires 237sec mtu 9000 pref medium
root@node1:/home/ubuntu# ip -6 a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 state UNKNOWN qlen 1000
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 state UP qlen 1000
inet6 2001:400:a100:3090:f816:3eff:fe1c:385f/64 scope global dynamic mngtmpaddr noprefixroute
valid_lft 86327sec preferred_lft 14327sec
inet6 fe80::f816:3eff:fe1c:385f/64 scope link
valid_lft forever preferred_lft forever
3: ens7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
inet6 2602:fcfb:1d:2::2/128 scope global
valid_lft forever preferred_lft forever
inet6 fe80::7f:aeff:fe44:cbc9/64 scope link
valid_lft forever preferred_lft forever
4: ens8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
inet6 fe80::bc:a6ff:fe3f:c7cb/64 scope link
valid_lft forever preferred_lft forever
5: ens9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
inet6 fe80::4e3:d6ff:fe00:5b06/64 scope link
valid_lft forever preferred_lft forever
root@node1:/home/ubuntu# ping6 2602:fcfb:1d:2::4
PING 2602:fcfb:1d:2::4(2602:fcfb:1d:2::4) 56 data bytes
^C
— 2602:fcfb:1d:2::4 ping statistics —
5 packets transmitted, 0 received, 100% packet loss, time 4098ms
July 24, 2023 at 5:15 pm #4811
Komal Thareja
Participant
Hi Fegping,

Node1: ens7 maps to NIC3 It was configured as below:

NOTE the prefixlen is set to 128 instead of 64.

ens7: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet6 2602:fcfb:1d:2::2 prefixlen 128 scopeid 0x0 inet6 fe80::7f:aeff:fe44:cbc9 prefixlen 64 scopeid 0x20 ether 02:7f:ae:44:cb:c9 txqueuelen 1000 (Ethernet) RX packets 28126 bytes 2617668 (2.6 MB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 2581 bytes 208710 (208.7 KB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

I brought this interface down and re-configured the IP address using the following command:

ip -6 addr add 2602:fcfb:1d:2::2/64 dev ens7

After this I can ping the gateway as well as other nodes.
root@node1:~# ping 2602:fcfb:1d:2::4 PING 2602:fcfb:1d:2::4(2602:fcfb:1d:2::4) 56 data bytes 64 bytes from 2602:fcfb:1d:2::4: icmp_seq=1 ttl=64 time=0.186 ms ^C --- 2602:fcfb:1d:2::4 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.186/0.186/0.186/0.000 ms root@node1:~# ping 2602:fcfb:1d:2::1 PING 2602:fcfb:1d:2::1(2602:fcfb:1d:2::1) 56 data bytes 64 bytes from 2602:fcfb:1d:2::1: icmp_seq=1 ttl=64 time=0.555 ms ^C --- 2602:fcfb:1d:2::1 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.555/0.555/0.555/0.000 ms

Node2: IP was configured on ens7 However, mac address for NIC3 02:15:60:C2:7A:AD maps to ens9
I configured ens9 with the command ip -6 addr add 2602:fcfb:1d:2::3/64 dev ens9and can now ping gateway and other nodes.

root@node2:~# ping 2602:fcfb:1d:2::1 PING 2602:fcfb:1d:2::1(2602:fcfb:1d:2::1) 56 data bytes 64 bytes from 2602:fcfb:1d:2::1: icmp_seq=1 ttl=64 time=0.948 ms 64 bytes from 2602:fcfb:1d:2::1: icmp_seq=2 ttl=64 time=0.440 ms ^C --- 2602:fcfb:1d:2::1 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1007ms rtt min/avg/max/mdev = 0.440/0.694/0.948/0.254 ms root@node2:~# ping 2602:fcfb:1d:2::2 PING 2602:fcfb:1d:2::2(2602:fcfb:1d:2::2) 56 data bytes 64 bytes from 2602:fcfb:1d:2::2: icmp_seq=1 ttl=64 time=0.146 ms 64 bytes from 2602:fcfb:1d:2::2: icmp_seq=2 ttl=64 time=0.082 ms ^C --- 2602:fcfb:1d:2::2 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1010ms rtt min/avg/max/mdev = 0.082/0.114/0.146/0.032 ms

Please configure the IPs on other interfaces or share the IPs and I can help configure them.

Thanks,
Komal
July 24, 2023 at 5:22 pm #4814
Fengping Hu
Participant
Hi Komal,

Thanks for looking into this and the note. It works nicely now!

Thanks,

Fengping
February 16, 2024 at 5:58 pm #6576
Fengping Hu
Participant
3 nodes in this very long running slice are down. Can some one bring the nodes back online when possible.

The nodes are node2, node3 and node4. The slice information is here just in case:

ID
2d12324d-66bc-410a-8dda-3c00d1ea0d48
Name
ServiceXSlice
Project ID
aac04e0e-e0fe-4421-8985-068c117d7437

Thanks,

Fengpig
February 18, 2024 at 3:05 pm #6580
Mert Cevik
Moderator
Hello Fengping,

We are working on this problem. We will post updates about the VMs.
Author

Posts

Viewing 15 posts - 1 through 15 (of 19 total)

1 2 →

You must be logged in to reply to this topic.