Home › Forums › FABRIC General Questions and Discussion › lost management network connection
- This topic has 12 replies, 5 voices, and was last updated 1 year, 7 months ago by Fengping Hu.
-
AuthorPosts
-
May 9, 2023 at 3:48 pm #4184
Two VMs(node1 and node4) out of four VMs in a slice lost management network connections. This is a long running slice and we are trying to understand what might have caused this. We can still get into the VM via the dataplane public IPs. This is more to understand the issue than an actual problem that affects the application we run. Please let us know if you need any other information in troubleshooting.
Here’s the slice information:
Slice
ID 37dffa35-36ee-4da1-a1bd-e84f36e2f69f
Name ServiceXSlice
Lease Expiration (UTC) 2023-04-07 22:56:21 +0000
Lease Start (UTC) 2023-04-06 22:56:23 +0000
Project ID aac04e0e-e0fe-4421-8985-068c117d7437
State StableOK
Nodes
ID Name Cores RAM Disk Image Image Type Host Site Username Management IP State Error SSH Command Public SSH Key File Private SSH Key File
978f6eff-42e7-48c1-86fc-e694ccb367e6 node1 60 384 100 default_ubuntu_20 qcow2 cern-w4.fabric-testbed.net CERN ubuntu 2001:400:a100:3090:f816:3eff:fe5c:dd28 Active ssh ππ ππππππ@
{Management IP} /home/fabric/work/fabric_config/slice_key.pub /home/fabric/work/fabric_config/slice_key
010fab90-cf7a-40be-aa74-1bfb2442a611 node2 60 384 100 default_ubuntu_20 qcow2 cern-w2.fabric-testbed.net CERN ubuntu 2001:400:a100:3090:f816:3eff:fe2d:38d0 Active ssh ππ ππππππ@
{Management IP} /home/fabric/work/fabric_config/slice_key.pub /home/fabric/work/fabric_config/slice_key
bf79de38-f8c2-413c-82ae-0daf50ebf76f node3 60 384 100 default_ubuntu_20 qcow2 cern-w6.fabric-testbed.net CERN ubuntu 2001:400:a100:3090:f816:3eff:feff:b39d Active ssh ππ ππππππ@
{Management IP} /home/fabric/work/fabric_config/slice_key.pub /home/fabric/work/fabric_config/slice_key
a1001ee4-c1d5-4157-a60d-d5dbc491ff05 node4 60 384 100 default_ubuntu_20 qcow2 cern-w3.fabric-testbed.net CERN ubuntu 2001:400:a100:3090:f816:3eff:fe99:a061 Active ssh ππ ππππππ@
{Management IP}“ip -6 nei” output from those 2 that lost connection
fe80::f816:3eff:feac:1ca0 dev ens3 lladdr fa:16:3e:ac:1c:a0 router STALEThanks,
FengpingMay 9, 2023 at 4:04 pm #4185Hello Fengping,
From one VM we need to see the output of Β “ip a”, “ip route”, “ip -6 route”. It can be also good to see the status of the network service (network, NetworkManager whichever is active).
May 9, 2023 at 5:15 pm #4186Hi Mert,
Here are the outputs of those commands.
Thanks,
Fengping`root@node1:/home/ubuntu# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc fq_codel state UP group default qlen 1000
link/ether fa:16:3e:5c:dd:28 brd ff:ff:ff:ff:ff:ff
inet 10.30.6.141/23 brd 10.30.7.255 scope global dynamic ens3
valid_lft 76573sec preferred_lft 76573sec
inet6 2001:400:a100:3090:f816:3eff:fe5c:dd28/64 scope global dynamic mngtmpaddr noprefixroute
valid_lft 86351sec preferred_lft 14351sec
inet6 fe80::f816:3eff:fe5c:dd28/64 scope link
valid_lft forever preferred_lft forever
3: ens7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 02:34:c0:a0:19:80 brd ff:ff:ff:ff:ff:ff
inet 10.143.1.2/24 scope global ens7
valid_lft forever preferred_lft forever
inet6 fe80::34:c0ff:fea0:1980/64 scope link
valid_lft forever preferred_lft forever
4: ens8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 02:78:91:60:1e:80 brd ff:ff:ff:ff:ff:ff
inet6 2602:fcfb:100::10/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::78:91ff:fe60:1e80/64 scope link
valid_lft forever preferred_lft forever
5: ens9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 0a:39:34:14:6a:36 brd ff:ff:ff:ff:ff:ff
inet6 2602:fcfb:1d:2::2/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::839:34ff:fe14:6a36/64 scope link
valid_lft forever preferred_lft forever
6: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default
link/ether f2:c7:6e:00:6e:83 brd ff:ff:ff:ff:ff:ff
inet 10.233.0.1/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.233.0.3/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.233.42.207/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.233.29.52/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.233.45.34/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.233.16.205/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.233.61.6/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.233.17.21/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.233.46.5/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.233.51.50/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.233.15.146/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.233.8.168/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.233.60.15/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.233.24.72/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.233.57.146/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.233.11.198/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.233.39.214/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.233.30.20/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.233.23.34/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.233.53.112/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.233.55.86/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.233.7.33/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.233.1.29/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.233.32.37/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet6 2602:fcfb:1d:2::31/128 scope global
valid_lft forever preferred_lft forever
inet6 fd85:ee78:d8a6:8607::11d7/128 scope global
valid_lft forever preferred_lft forever
inet6 2602:fcfb:1d:2::30/128 scope global
valid_lft forever preferred_lft forever
inet6 fd85:ee78:d8a6:8607::1417/128 scope global
valid_lft forever preferred_lft forever
10: cali5260a6a960b@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-f19391b0-0d22-5e0a-091b-f26982131f4a
inet6 fe80::ecee:eeff:feee:eeee/64 scope link
valid_lft forever preferred_lft forever
11: nodelocaldns: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default
link/ether e6:a9:23:78:b6:d9 brd ff:ff:ff:ff:ff:ff
inet 169.254.25.10/32 scope global nodelocaldns
valid_lft forever preferred_lft forever
23: califc30ed8b084@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-0e75fb3c-1182-2e6e-16c9-80ba6ce3844f
inet6 fe80::ecee:eeff:feee:eeee/64 scope link
valid_lft forever preferred_lft forever
821: cali0894f966cdf@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-ee474322-80e1-7721-a798-bc29a08f2339
inet6 fe80::ecee:eeff:feee:eeee/64 scope link
valid_lft forever preferred_lft forever
855: cali68b9b56a958@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-e232ac9b-3c1d-fe2e-dd33-cc9f06d121a2
inet6 fe80::ecee:eeff:feee:eeee/64 scope link
valid_lft forever preferred_lft forever
741: cali3de775f9cb2@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-3b3f6547-73f2-7f83-add2-d43a23bc1a83
inet6 fe80::ecee:eeff:feee:eeee/64 scope link
valid_lft forever preferred_lft forever
747: cali0fe3bc102a1@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-9d7ab47a-4e5c-cb1b-86da-65fb323732fb
inet6 fe80::ecee:eeff:feee:eeee/64 scope link
valid_lft forever preferred_lft forever
754: calif913c9d9a21@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-7b67f49b-0c8a-cf9d-e976-0439a6057cdd
inet6 fe80::ecee:eeff:feee:eeee/64 scope link
valid_lft forever preferred_lft forever
767: calieb3159a0e6d@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-0841de8e-b94c-b3cb-50d0-60b2156da119
inet6 fe80::ecee:eeff:feee:eeee/64 scope link
valid_lft forever preferred_lft forever
root@node1:/home/ubuntu# ip route
default via 10.143.1.1 dev ens7
10.30.6.0/23 dev ens3 proto kernel scope link src 10.30.6.141
10.143.1.0/24 dev ens7 proto kernel scope link src 10.143.1.2
10.233.71.0/26 via 10.143.1.4 dev ens7 proto bird
10.233.74.64/26 via 10.143.1.5 dev ens7 proto bird
10.233.75.0/26 via 10.143.1.3 dev ens7 proto bird
blackhole 10.233.102.128/26 proto bird
10.233.102.129 dev cali5260a6a960b scope link
10.233.102.134 dev cali68b9b56a958 scope link
10.233.102.139 dev califc30ed8b084 scope link
10.233.102.161 dev cali3de775f9cb2 scope link
10.233.102.166 dev cali0fe3bc102a1 scope link
10.233.102.173 dev calif913c9d9a21 scope link
10.233.102.185 dev cali0894f966cdf scope link
10.233.102.188 dev calieb3159a0e6d scope link
root@node1:/home/ubuntu# ip -6 route
::1 dev lo proto kernel metric 256 pref medium
2001:400:a100:3090::/64 dev ens3 proto ra metric 100 expires 86399sec pref medium
2602:fcfb:1d:2::/64 dev ens9 proto kernel metric 256 pref medium
2602:fcfb:1d:2::/64 dev ens9 metric 1024 pref medium
2602:fcfb:100::/64 dev ens8 proto kernel metric 256 pref medium
2602:fcfb:100::/64 dev ens8 metric 1024 pref medium
fd85:ee78:d8a6:8607::1:340/122 via 2602:fcfb:1d:2::5 dev ens9 proto bird metric 1024 pref medium
fd85:ee78:d8a6:8607::1:6800/122 via 2602:fcfb:1d:2::3 dev ens9 proto bird metric 1024 pref medium
fd85:ee78:d8a6:8607::1:8700/122 via 2602:fcfb:1d:2::4 dev ens9 proto bird metric 1024 pref medium
fd85:ee78:d8a6:8607::1:a681 dev cali5260a6a960b metric 1024 pref medium
fd85:ee78:d8a6:8607::1:a686 dev cali68b9b56a958 metric 1024 pref medium
fd85:ee78:d8a6:8607::1:a68b dev califc30ed8b084 metric 1024 pref medium
fd85:ee78:d8a6:8607::1:a6a1 dev cali3de775f9cb2 metric 1024 pref medium
fd85:ee78:d8a6:8607::1:a6a6 dev cali0fe3bc102a1 metric 1024 pref medium
fd85:ee78:d8a6:8607::1:a6ad dev calif913c9d9a21 metric 1024 pref medium
fd85:ee78:d8a6:8607::1:a6b9 dev cali0894f966cdf metric 1024 pref medium
fd85:ee78:d8a6:8607::1:a6bc dev calieb3159a0e6d metric 1024 pref medium
blackhole fd85:ee78:d8a6:8607::1:a680/122 dev lo proto bird metric 1024 pref medium
fe80::a9fe:a9fe via fe80::f816:3eff:feac:1ca0 dev ens3 proto ra metric 1024 expires 299sec pref medium
fe80::/64 dev ens3 proto kernel metric 256 pref medium
fe80::/64 dev ens7 proto kernel metric 256 pref medium
fe80::/64 dev ens8 proto kernel metric 256 pref medium
fe80::/64 dev ens9 proto kernel metric 256 pref medium
fe80::/64 dev cali5260a6a960b proto kernel metric 256 pref medium
fe80::/64 dev califc30ed8b084 proto kernel metric 256 pref medium
fe80::/64 dev cali3de775f9cb2 proto kernel metric 256 pref medium
fe80::/64 dev cali0fe3bc102a1 proto kernel metric 256 pref medium
fe80::/64 dev calif913c9d9a21 proto kernel metric 256 pref medium
fe80::/64 dev calieb3159a0e6d proto kernel metric 256 pref medium
fe80::/64 dev cali0894f966cdf proto kernel metric 256 pref medium
fe80::/64 dev cali68b9b56a958 proto kernel metric 256 pref medium
default via 2602:fcfb:1d:2::1 dev ens9 metric 10 pref medium
default via fe80::f816:3eff:feac:1ca0 dev ens3 proto ra metric 100 expires 299sec mtu 9000 pref medium
root@node1:/home/ubuntu# ip -6 rule show
0: from all lookup local
32761: from 2602:fcfb:100::/64 lookup v6peering
32762: from 2001:400:a100:3090::/64 lookup admin
32766: from all lookup main
root@node1:/home/ubuntu# ip -6 route show table admin
default via fe80::f816:3eff:feac:1ca0 dev ens3 metric 1024 pref mediumroot@node1:/home/ubuntu# systemctl status systemd-networkd
β systemd-networkd.service – Network Service
Loaded: loaded (/lib/systemd/system/systemd-networkd.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2023-04-07 06:46:29 UTC; 1 months 2 days ago
TriggeredBy: β systemd-networkd.socket
Docs: man:systemd-networkd.service(8)
Main PID: 313692 (systemd-network)
Status: “Processing requests…”
Tasks: 1 (limit: 464085)
Memory: 21.9M
CGroup: /system.slice/systemd-networkd.service
ββ313692 /lib/systemd/systemd-networkdMay 09 19:29:21 node1 systemd-networkd[313692]: califcd611d2e5a: Lost carrier
May 09 19:29:23 node1 systemd-networkd[313692]: cali6d43fe7c75d: Link DOWN
May 09 19:29:23 node1 systemd-networkd[313692]: cali6d43fe7c75d: Lost carrier
May 09 19:54:08 node1 systemd-networkd[313692]: cali0905f2a129d: Link DOWN
May 09 19:54:08 node1 systemd-networkd[313692]: cali0905f2a129d: Lost carrier
May 09 19:54:45 node1 systemd-networkd[313692]: calidba7a9afee0: Link DOWN
May 09 19:54:45 node1 systemd-networkd[313692]: calidba7a9afee0: Lost carrier
May 09 19:55:37 node1 systemd-networkd[313692]: cali68b9b56a958: Link UP
May 09 19:55:37 node1 systemd-networkd[313692]: cali68b9b56a958: Gained carrier
May 09 19:55:39 node1 systemd-networkd[313692]: cali68b9b56a958: Gained IPv6LLMay 9, 2023 at 5:30 pm #4187Thank you for the output. I will look at this carefully later, but with a quick view, it looks like the default IPv6 route is switched to the dataplane (can be FABv6) network (see the output from ip -6 route). Management (ssh) access is routed from the ens3 interface (IPv6 2001:400:a100:3090:f816:3eff:fe5c:dd28/64 ) and default IPv6 gateway for this subnet is 2001:400:a100:3090::1. Looks like a configuration change occurred.
May 9, 2023 at 5:48 pm #4188Yep. We configured the default route to be on the public network on the dataplane. The management network is routed via policy based routing.
May 9, 2023 at 7:17 pm #4189It seems that the root cause of the connectivity issue is known then, probably you need to review your policy based routing. Do you still need anything from us on this?
May 9, 2023 at 7:51 pm #4190It appears the linklocal address of the router is stale. So it looks as if it lost the layer 2 connection. The policy based routing seems ok and it’s the same as with the other two nodes that are working.
ip -6 nei
fe80::f816:3eff:feac:1ca0 dev ens3 lladdr fa:16:3e:ac:1c:a0 router STALEMay 9, 2023 at 8:43 pm #4192Fengping,
This is Ubuntu?
We configured the default route to be on the public network on the dataplane. The management network is routed via policy based routing.
Can you post that config? Are the PBR rules based on interface? IP?
May 10, 2023 at 2:12 pm #4196Hi David,
Yep. It’s an ubuntu.
root@node1:/home/ubuntu# ip -6 rule list
0: from all lookup local
32761: from 2602:fcfb:100::/64 lookup v6peering
32762: from 2001:400:a100:3090::/64 lookup admin
32766: from all lookup main
root@node1:/home/ubuntu# ip -6 route show table admin
default via fe80::f816:3eff:feac:1ca0 dev ens3 metric 1024 pref mediumThanks,
FengpingMay 11, 2023 at 3:38 pm #4215@Fengping
This seems like an issue with the policy based routing.Β Is there a way to make your setup more resilient to external factors?Β Β The management network connects to the local campus or provider. I suspect we can’t control everything on the management network.
Paul
May 11, 2023 at 3:55 pm #4220Thanks for the clarification. We do still have access to the VM via the data plane public ip. So I think our setup is resilient in that sense.
Thanks,
FengpingMay 11, 2023 at 4:33 pm #4221@Fengpin – our authority at individual sites w.r.t. management plane connections end on our switch, so if the campus network blinks, we have no control over that. Is there a way for you to force PBR to pick the default via management connection again?
May 12, 2023 at 6:04 pm #4229Thanks Ilya. I tried ip link down and up and this was able to bring the management network online again. I would just need to readd PBR since ip link down would clear that it seems. But that’s not a problem.Β With what you said I think I will just manually deal with potential blinks and not worry about it too much.
Fengping
-
AuthorPosts
- You must be logged in to reply to this topic.