1. Long running slice stability issue. 

Long running slice stability issue. 

Home Forums FABRIC General Questions and Discussion Long running slice stability issue. 

Viewing 15 posts - 1 through 15 (of 19 total)
  • Author
    Posts
  • #4702
    Fengping Hu
    Participant

      It looks a couple of VMs(node1 and node2)  of my slice lost contact. Both the management IP and a public data plane ip stopped responding to ping. Unfortunately the head node is one of the node that is lost.  The slice information is attached.  Before I try to rebuild the slice, is it possible to understand how the VMs got lost. Rebuilding seems to be pretty time consuming(many hours) because the node.execute would take a very long time to run, so it would be great if it’s possible to bring back the vms that somehow lost contact.

      Thanks,

      Fengping

      ID
      2d12324d-66bc-410a-8dda-3c00d1ea0d48
      Name
      ServiceXSlice
      Lease Expiration (UTC)
      2023-06-27 13:55:25 +0000
      Lease Start (UTC)
      2023-06-26 13:55:26 +0000
      Project ID
      aac04e0e-e0fe-4421-8985-068c117d7437
      State
      StableOK

       

       

      #4731
      Ilya Baldin
      Participant

        Fengping,

        We are a bit thinly staffed this week, but I will ask someone on operations team to look into it. Can you post any relevant configuration bits – what you may have done right before you lost access, post boot scripts or node.execute() scripts? The most likely cause is a change in routing inside the node that fried the default route pointing back to the management interface. Other options are possible, of course.

        #4733
        Mert Cevik
        Moderator

          2 VMs were stopped by the hypervisor and I started them. Can you please check the status and your access?

          Root cause of this problem is a known issue that we could correct on Phase-1 sites last month, but some of the Phase-2 sites did not receive this correction yet. We will find convenient time in the next few weeks, for now we will be able to help when this occurs.

          #4775
          Fengping Hu
          Participant

            Thank you for getting back to me in your busy schedules. It’s great that this is an known problem and a fix will be deployed.

            The restarted VMs are accessible via the management interface now. But it seems it lost all the other links(we have 3 data plane links). Is there a way to reattach those links?

            Thanks,

            Fengping

             

            #4780
            Komal Thareja
            Participant

              Hello Fengping,

              I have re-attached the pci devices for the VMs: node1and node2. You would need to reassign the IP addresses back on them for your links to work. Please let us know if the links are working as expected after configuring the IP addresses.

              Thanks,

              Komal

              #4782
              Fengping Hu
              Participant

                Hi Komal,

                I configured the IP but it seems the link is not working. Unfortunately I changed default route to use this public dataplane link before I put in the policy based routing(which should make the management interface work), so I couldn’t access the vm now.

                My theory is that the name of the links probably have changed, I assumed ens9 is NET3 but it probably that has changed after reboot. I will wait to see if RA will clean the default route so I can get in and do the fix. Otherwise I can rebuild the slice if it’s too much trouble.

                 

                Thanks,

                Fengping

                #4785
                Komal Thareja
                Participant

                  Hi Fengping,

                  I think ens7 -> net1, ens8->net3 and ens9 -> net2 Please let me know once you get the public access back. I can help figure out the interfaces.

                  Thanks,

                  Komal

                  #4786
                  Komal Thareja
                  Participant

                    You can confirm the interfaces for Node1 and Node2 via the mac addresses:

                    Node1

                    02:7F:AE:44:CB:C9 => NIC3

                    06:E3:D6:00:5B:06=> NIC2

                    02:BC:A6:3F:C7:CB=> NIC1

                    Node2

                    02:15:60:C2:7A:AD=>NIC3

                    02:1D:B9:31:E7:23=> NIC2

                    02:B5:53:89:2C:E6=> NIC1

                    Thanks,

                    Komal

                    #4797
                    Fengping Hu
                    Participant

                      Hi Komal,

                      Thanks for looking up information for me. Can you kindly reboot both Node1 and Node2. For Node1, it seems I can’t regain access but a reboot would because that would clear the default route I added. For Node2 actually I was never able to login so hopefully a reboot can fix that as well.

                      Thanks,

                      Fengping

                      #4802
                      Komal Thareja
                      Participant

                        Hi Fengping,

                        I have rebooted both Node1 and Node2. They should be accessible now. Please set up the IPs as per the mac addresses shared above. Please do let me know if anything else is needed form my side.

                        Thanks,

                        Komal

                         

                         

                        #4810
                        Fengping Hu
                        Participant

                          Hi Komal,

                          Thanks for the help. I can login to both VMs via the management interface now. I can also see the 3  dataplane links are present on both VMs. However after I configured the IP on the public dataplane network(NET3 – so NIC with mac 02:7F:AE:44:CB:C9), it can’t reach gateway or other vms in the same network.  I also tried to put this IP on the other two links and none of them works. Is there another step needed to make the links get attached to those networks?

                          Thanks,

                          Fenpging

                          root@node1:/home/ubuntu# ip -6 route
                          ::1 dev lo proto kernel metric 256 pref medium
                          2001:400:a100:3090::/64 dev ens3 proto ra metric 100 expires 86337sec pref medium
                          2602:fcfb:1d:2::2 dev ens7 proto kernel metric 256 pref medium
                          fe80::a9fe:a9fe via fe80::f816:3eff:feac:1ca0 dev ens3 proto ra metric 1024 expires 237sec pref medium
                          fe80::/64 dev ens9 proto kernel metric 256 pref medium
                          fe80::/64 dev ens3 proto kernel metric 256 pref medium
                          fe80::/64 dev ens8 proto kernel metric 256 pref medium
                          fe80::/64 dev ens7 proto kernel metric 256 pref medium
                          default via fe80::f816:3eff:feac:1ca0 dev ens3 proto ra metric 100 expires 237sec mtu 9000 pref medium
                          root@node1:/home/ubuntu# ip -6 a
                          1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 state UNKNOWN qlen 1000
                          inet6 ::1/128 scope host
                          valid_lft forever preferred_lft forever
                          2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 state UP qlen 1000
                          inet6 2001:400:a100:3090:f816:3eff:fe1c:385f/64 scope global dynamic mngtmpaddr noprefixroute
                          valid_lft 86327sec preferred_lft 14327sec
                          inet6 fe80::f816:3eff:fe1c:385f/64 scope link
                          valid_lft forever preferred_lft forever
                          3: ens7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
                          inet6 2602:fcfb:1d:2::2/128 scope global
                          valid_lft forever preferred_lft forever
                          inet6 fe80::7f:aeff:fe44:cbc9/64 scope link
                          valid_lft forever preferred_lft forever
                          4: ens8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
                          inet6 fe80::bc:a6ff:fe3f:c7cb/64 scope link
                          valid_lft forever preferred_lft forever
                          5: ens9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
                          inet6 fe80::4e3:d6ff:fe00:5b06/64 scope link
                          valid_lft forever preferred_lft forever
                          root@node1:/home/ubuntu# ping6 2602:fcfb:1d:2::4
                          PING 2602:fcfb:1d:2::4(2602:fcfb:1d:2::4) 56 data bytes
                          ^C
                          — 2602:fcfb:1d:2::4 ping statistics —
                          5 packets transmitted, 0 received, 100% packet loss, time 4098ms

                          #4811
                          Komal Thareja
                          Participant

                            Hi Fegping,

                            Node1: ens7 maps to NIC3 It was configured as below:

                            NOTE the prefixlen is set to 128 instead of 64.

                            ens7: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
                            inet6 2602:fcfb:1d:2::2 prefixlen 128 scopeid 0x0
                            inet6 fe80::7f:aeff:fe44:cbc9 prefixlen 64 scopeid 0x20 ether 02:7f:ae:44:cb:c9 txqueuelen 1000 (Ethernet)
                            RX packets 28126 bytes 2617668 (2.6 MB)
                            RX errors 0 dropped 0 overruns 0 frame 0
                            TX packets 2581 bytes 208710 (208.7 KB)
                            TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

                            I brought this interface down and re-configured the IP address using the following command:

                            ip -6 addr add 2602:fcfb:1d:2::2/64 dev ens7

                            After this I can ping the gateway as well as other nodes.

                            root@node1:~# ping 2602:fcfb:1d:2::4
                            PING 2602:fcfb:1d:2::4(2602:fcfb:1d:2::4) 56 data bytes
                            64 bytes from 2602:fcfb:1d:2::4: icmp_seq=1 ttl=64 time=0.186 ms
                            ^C
                            --- 2602:fcfb:1d:2::4 ping statistics ---
                            1 packets transmitted, 1 received, 0% packet loss, time 0ms
                            rtt min/avg/max/mdev = 0.186/0.186/0.186/0.000 ms
                            root@node1:~# ping 2602:fcfb:1d:2::1
                            PING 2602:fcfb:1d:2::1(2602:fcfb:1d:2::1) 56 data bytes
                            64 bytes from 2602:fcfb:1d:2::1: icmp_seq=1 ttl=64 time=0.555 ms
                            ^C
                            --- 2602:fcfb:1d:2::1 ping statistics ---
                            1 packets transmitted, 1 received, 0% packet loss, time 0ms
                            rtt min/avg/max/mdev = 0.555/0.555/0.555/0.000 ms

                            Node2: IP was configured on ens7 However, mac address for NIC3 02:15:60:C2:7A:AD maps to ens9
                            I configured ens9 with the command ip -6 addr add 2602:fcfb:1d:2::3/64 dev ens9and can now ping gateway and other nodes.

                             

                            root@node2:~# ping 2602:fcfb:1d:2::1
                            PING 2602:fcfb:1d:2::1(2602:fcfb:1d:2::1) 56 data bytes
                            64 bytes from 2602:fcfb:1d:2::1: icmp_seq=1 ttl=64 time=0.948 ms
                            64 bytes from 2602:fcfb:1d:2::1: icmp_seq=2 ttl=64 time=0.440 ms
                            ^C
                            --- 2602:fcfb:1d:2::1 ping statistics ---
                            2 packets transmitted, 2 received, 0% packet loss, time 1007ms
                            rtt min/avg/max/mdev = 0.440/0.694/0.948/0.254 ms
                            root@node2:~# ping 2602:fcfb:1d:2::2
                            PING 2602:fcfb:1d:2::2(2602:fcfb:1d:2::2) 56 data bytes
                            64 bytes from 2602:fcfb:1d:2::2: icmp_seq=1 ttl=64 time=0.146 ms
                            64 bytes from 2602:fcfb:1d:2::2: icmp_seq=2 ttl=64 time=0.082 ms
                            ^C
                            --- 2602:fcfb:1d:2::2 ping statistics ---
                            2 packets transmitted, 2 received, 0% packet loss, time 1010ms
                            rtt min/avg/max/mdev = 0.082/0.114/0.146/0.032 ms

                            Please configure the IPs on other interfaces or share the IPs and I can help configure them.

                            Thanks,
                            Komal

                            #4814
                            Fengping Hu
                            Participant

                              Hi Komal,

                              Thanks for looking into this and the note.  It works nicely now!

                              Thanks,

                              Fengping

                              #6576
                              Fengping Hu
                              Participant

                                3 nodes in this very long running slice are down. Can some one bring the nodes back  online when possible.

                                The nodes are node2, node3 and node4.   The slice information is here just in case:

                                ID
                                2d12324d-66bc-410a-8dda-3c00d1ea0d48
                                Name
                                ServiceXSlice
                                Project ID
                                aac04e0e-e0fe-4421-8985-068c117d7437

                                Thanks,

                                Fengpig

                                #6580
                                Mert Cevik
                                Moderator

                                  Hello Fengping,

                                  We are working on this problem. We will post updates about the VMs.

                                Viewing 15 posts - 1 through 15 (of 19 total)
                                • You must be logged in to reply to this topic.