1. Fengping Hu

Fengping Hu

Forum Replies Created

Viewing 15 posts - 1 through 15 (of 41 total)
  • Author
    Posts
  • in reply to: Long running slice stability issue.  #6684
    Fengping Hu
    Participant

      Just to update that I was able to reboot the node and put it back in service.  shutdown -r didn’t work but reboot -f did the trick.

      root@node4:/home/ubuntu# /sbin/shutdown -r now
      Failed to open initctl fifo: No such device or address
      Failed to talk to init daemon.
      root@node4:/home/ubuntu# reboot -f
      Rebooting.
      Connection to 2001:400:a100:3090:f816:3eff:fe8a:f1d1 closed by remote host.
      Connection to 2001:400:a100:3090:f816:3eff:fe8a:f1d1 closed.

       

      Thanks,

      Fengping

      in reply to: Long running slice stability issue.  #6591
      Fengping Hu
      Participant

        Hi Mert,

        Thanks for getting the vms online again. I was able to put node2 and node3 back in service. But still have some issues with node4. It looks this node is having some trouble with resolved. I tried to reboot it but that’s not possible either. Can you reboot this one for me.

        root@node4:/home/ubuntu# systemctl status systemd-resolved
        Failed to get properties: Connection timed out
        root@node4:/home/ubuntu# /sbin/shutdown -r now
        Failed to open initctl fifo: No such device or address
        Failed to talk to init daemon.

         

        Thanks,

        Fengping

        in reply to: Long running slice stability issue.  #6576
        Fengping Hu
        Participant

          3 nodes in this very long running slice are down. Can some one bring the nodes back  online when possible.

          The nodes are node2, node3 and node4.   The slice information is here just in case:

          ID
          2d12324d-66bc-410a-8dda-3c00d1ea0d48
          Name
          ServiceXSlice
          Project ID
          aac04e0e-e0fe-4421-8985-068c117d7437

          Thanks,

          Fengpig

          in reply to: revive the ServiceXSlice? #5224
          Fengping Hu
          Participant

            Hi Mert,

            Thank you so much for the help. I have reconfigured everything and the slice is back in service.

            Thanks,

            Fengping

            in reply to: revive the ServiceXSlice? #5216
            Fengping Hu
            Participant

              Hi Mert,

              Thanks for looking into it for me. Indeed I can login into the vms nows. Network is also fine. So you can withdraw the inquiry to your network team. I was using the wrong ips.

              The problem is actually it seems that the node1(2001:400:a100:3090:f816:3eff:fe1c:385f) is rebooted and thus it lost  three network links. Can you reattach the links for me.

              For example on node with all the links it should be like this:

              ubuntu@node9:~$ ip link
              1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
              link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
              2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
              link/ether fa:16:3e:56:ac:b7 brd ff:ff:ff:ff:ff:ff
              3: ens8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
              link/ether 02:e1:a2:04:48:a3 brd ff:ff:ff:ff:ff:ff
              4: ens7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
              link/ether 06:d3:95:0b:44:81 brd ff:ff:ff:ff:ff:ff
              5: ens9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
              link/ether 0a:df:cf:c5:fd:f5 brd ff:ff:ff:ff:ff:ff

              but on node one I get this

              ubuntu@node1:~$ ip link
              1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
              link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
              2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
              link/ether fa:16:3e:1c:38:5f brd ff:ff:ff:ff:ff:ff
              ubuntu@node1:~$

               

              if you could reattach ens7, ens8, ens9 to node1, that would be great.

              Thanks,

              Fengping

              in reply to: Long running slice stability issue.  #4814
              Fengping Hu
              Participant

                Hi Komal,

                Thanks for looking into this and the note.  It works nicely now!

                Thanks,

                Fengping

                in reply to: Long running slice stability issue.  #4810
                Fengping Hu
                Participant

                  Hi Komal,

                  Thanks for the help. I can login to both VMs via the management interface now. I can also see the 3  dataplane links are present on both VMs. However after I configured the IP on the public dataplane network(NET3 – so NIC with mac 02:7F:AE:44:CB:C9), it can’t reach gateway or other vms in the same network.  I also tried to put this IP on the other two links and none of them works. Is there another step needed to make the links get attached to those networks?

                  Thanks,

                  Fenpging

                  root@node1:/home/ubuntu# ip -6 route
                  ::1 dev lo proto kernel metric 256 pref medium
                  2001:400:a100:3090::/64 dev ens3 proto ra metric 100 expires 86337sec pref medium
                  2602:fcfb:1d:2::2 dev ens7 proto kernel metric 256 pref medium
                  fe80::a9fe:a9fe via fe80::f816:3eff:feac:1ca0 dev ens3 proto ra metric 1024 expires 237sec pref medium
                  fe80::/64 dev ens9 proto kernel metric 256 pref medium
                  fe80::/64 dev ens3 proto kernel metric 256 pref medium
                  fe80::/64 dev ens8 proto kernel metric 256 pref medium
                  fe80::/64 dev ens7 proto kernel metric 256 pref medium
                  default via fe80::f816:3eff:feac:1ca0 dev ens3 proto ra metric 100 expires 237sec mtu 9000 pref medium
                  root@node1:/home/ubuntu# ip -6 a
                  1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 state UNKNOWN qlen 1000
                  inet6 ::1/128 scope host
                  valid_lft forever preferred_lft forever
                  2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 state UP qlen 1000
                  inet6 2001:400:a100:3090:f816:3eff:fe1c:385f/64 scope global dynamic mngtmpaddr noprefixroute
                  valid_lft 86327sec preferred_lft 14327sec
                  inet6 fe80::f816:3eff:fe1c:385f/64 scope link
                  valid_lft forever preferred_lft forever
                  3: ens7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
                  inet6 2602:fcfb:1d:2::2/128 scope global
                  valid_lft forever preferred_lft forever
                  inet6 fe80::7f:aeff:fe44:cbc9/64 scope link
                  valid_lft forever preferred_lft forever
                  4: ens8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
                  inet6 fe80::bc:a6ff:fe3f:c7cb/64 scope link
                  valid_lft forever preferred_lft forever
                  5: ens9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
                  inet6 fe80::4e3:d6ff:fe00:5b06/64 scope link
                  valid_lft forever preferred_lft forever
                  root@node1:/home/ubuntu# ping6 2602:fcfb:1d:2::4
                  PING 2602:fcfb:1d:2::4(2602:fcfb:1d:2::4) 56 data bytes
                  ^C
                  — 2602:fcfb:1d:2::4 ping statistics —
                  5 packets transmitted, 0 received, 100% packet loss, time 4098ms

                  in reply to: Long running slice stability issue.  #4797
                  Fengping Hu
                  Participant

                    Hi Komal,

                    Thanks for looking up information for me. Can you kindly reboot both Node1 and Node2. For Node1, it seems I can’t regain access but a reboot would because that would clear the default route I added. For Node2 actually I was never able to login so hopefully a reboot can fix that as well.

                    Thanks,

                    Fengping

                    in reply to: Long running slice stability issue.  #4782
                    Fengping Hu
                    Participant

                      Hi Komal,

                      I configured the IP but it seems the link is not working. Unfortunately I changed default route to use this public dataplane link before I put in the policy based routing(which should make the management interface work), so I couldn’t access the vm now.

                      My theory is that the name of the links probably have changed, I assumed ens9 is NET3 but it probably that has changed after reboot. I will wait to see if RA will clean the default route so I can get in and do the fix. Otherwise I can rebuild the slice if it’s too much trouble.

                       

                      Thanks,

                      Fengping

                      in reply to: Long running slice stability issue.  #4775
                      Fengping Hu
                      Participant

                        Thank you for getting back to me in your busy schedules. It’s great that this is an known problem and a fix will be deployed.

                        The restarted VMs are accessible via the management interface now. But it seems it lost all the other links(we have 3 data plane links). Is there a way to reattach those links?

                        Thanks,

                        Fengping

                         

                        in reply to: manual cleanup needed? #4582
                        Fengping Hu
                        Participant

                          Hi Komal,

                          Thank you so much for looking into the issue and quick fix. I will delete the slice and recreate tomorrow.

                          Appreciate your help:)

                          Fengping

                          in reply to: manual cleanup needed? #4578
                          Fengping Hu
                          Participant

                            Hi Komal,

                            It seems the slice lost public ipv6 network connection over night.  I can’t even ping the gateway. The link lost the ips I configured statically  even though I had disabled dhcp and ra for the link. So I tried to readd the ip and routes as well as tried both  network3.change_public_ip(ipv6=list(map(str,networkips[0:50]))) and network3.make_ip_publicly_routable(ipv6=list(map(str,networkips[0:50]))) to try to make the ips public. But none seemed to work.

                            Any suggestion on how to fix this network.

                            Thanks,

                            Fengping

                            Here’s the slice information and symptons

                            slice  ID
                            08d05419-e99b-4ebe-b4a1-88c07cf2bfa3
                            Name
                            ServiceXSlice

                            network id

                            06d92831-1f58-4548-9d24-9284b1273912
                            NET3
                            L3
                            FABNetv6Ext
                            CERN
                            2602:fcfb:1d:3::/64
                            2602:fcfb:1d:3::1
                            Active

                             

                            buntu@node1:~$ ping6 2602:fcfb:1d:3::1
                            PING 2602:fcfb:1d:3::1(2602:fcfb:1d:3::1) 56 data bytes
                            ^C
                            — 2602:fcfb:1d:3::1 ping statistics —
                            3 packets transmitted, 0 received, 100% packet loss, time 2056ms

                            ubuntu@node1:~$ ip -6 neigh | grep 2602
                            2602:fcfb:1d:3::7 dev ens9 lladdr 02:d2:f1:99:87:98 router REACHABLE
                            2602:fcfb:1d:3::9 dev ens9 lladdr 02:80:38:25:66:c0 router REACHABLE
                            2602:fcfb:1d:3::4 dev ens9 lladdr 02:1d:b9:31:e7:23 router STALE
                            2602:fcfb:1d:3::b dev ens9 lladdr 06:d3:95:0b:44:81 router REACHABLE
                            2602:fcfb:1d:3::6 dev ens9 lladdr 0a:b1:19:54:14:e7 router REACHABLE
                            2602:fcfb:1d:3::1 dev ens9 router FAILED

                            in reply to: manual cleanup needed? #4577
                            Fengping Hu
                            Participant

                              Hi Komal,

                              I tried your recipe and was able to create 10 vms with 60 cores each, but it failed to create 11 or 12 vms due to insufficient cpus. This is a bit counter intuitive since there were 766 cpus available and each of the 6 hosts should be able to run 2 vms. Nevertheless, we are in a better shape now with 600+ cores.  Thank you so much for the help. I will try the new flavor when it’s available.

                               

                              Thanks,

                              Fengping

                              in reply to: manual cleanup needed? #4574
                              Fengping Hu
                              Participant

                                Hi Komal,

                                Thanks for looking into this for me.  This config – cores=’62’, ram=’384′, disk=’2000′ indeed works to create 6 vms. But it won’t work if I try to create 12 vms even if I request half ram(192) because of flavor mapping. So yes we do need a better flavor in my case.  I may need only one big disk node to server as  a xcache node,  the rest of the nodes can have just limited disks unless we want to use all the disks to setup distributed storage(ceph etc)

                                Please let me know once you have discussed about this with your team and have recommendations. The goal is to allocate all the resources with not many vm flavors (one or two maybe).

                                Thanks,

                                Fengping

                                in reply to: manual cleanup needed? #4571
                                Fengping Hu
                                Participant

                                  Hi Komal,

                                  The CERN is sort of dedicated for ServiceX deployment. I will need to create my slice there for data access reasons. I don’t think there should be other slices at CERN other than the ServiceX slice I created. I would like to create big vms that basically map to physically machines. So 6 VMs for 6 physical machines at CERN.

                                  I noticed the available CPUs are 408/768, so it’s 360 less than total which is exactly the number of cpus I requested for my slice this morning. This made me wonder if that slice is still holding up the resources. If the resources are not hold up by the dead slice but active slices, would you be able to relocate them so I can create my slice there?

                                  Also what resource request should I use to make vm take up a whole physical machine?

                                  Thanks,

                                  Fengping

                                Viewing 15 posts - 1 through 15 (of 41 total)