Fengping Hu

Forum Replies Created

Viewing 15 posts - 1 through 15 (of 41 total)

1 2 3 →

Author

Posts
March 4, 2024 at 4:56 pm in reply to: Long running slice stability issue. #6684
Fengping Hu
Participant
Just to update that I was able to reboot the node and put it back in service. shutdown -r didn’t work but reboot -f did the trick.

root@node4:/home/ubuntu# /sbin/shutdown -r now
Failed to open initctl fifo: No such device or address
Failed to talk to init daemon.
root@node4:/home/ubuntu# reboot -f
Rebooting.
Connection to 2001:400:a100:3090:f816:3eff:fe8a:f1d1 closed by remote host.
Connection to 2001:400:a100:3090:f816:3eff:fe8a:f1d1 closed.

Thanks,

Fengping
February 20, 2024 at 1:58 pm in reply to: Long running slice stability issue. #6591
Fengping Hu
Participant
Hi Mert,

Thanks for getting the vms online again. I was able to put node2 and node3 back in service. But still have some issues with node4. It looks this node is having some trouble with resolved. I tried to reboot it but that’s not possible either. Can you reboot this one for me.

root@node4:/home/ubuntu# systemctl status systemd-resolved
Failed to get properties: Connection timed out
root@node4:/home/ubuntu# /sbin/shutdown -r now
Failed to open initctl fifo: No such device or address
Failed to talk to init daemon.

Thanks,

Fengping
February 16, 2024 at 5:58 pm in reply to: Long running slice stability issue. #6576
Fengping Hu
Participant
3 nodes in this very long running slice are down. Can some one bring the nodes back online when possible.

The nodes are node2, node3 and node4. The slice information is here just in case:

ID
2d12324d-66bc-410a-8dda-3c00d1ea0d48
Name
ServiceXSlice
Project ID
aac04e0e-e0fe-4421-8985-068c117d7437

Thanks,

Fengpig
September 5, 2023 at 2:38 pm in reply to: revive the ServiceXSlice? #5224
Fengping Hu
Participant
Hi Mert,

Thank you so much for the help. I have reconfigured everything and the slice is back in service.

Thanks,

Fengping
September 1, 2023 at 5:33 pm in reply to: revive the ServiceXSlice? #5216
Fengping Hu
Participant
Hi Mert,

Thanks for looking into it for me. Indeed I can login into the vms nows. Network is also fine. So you can withdraw the inquiry to your network team. I was using the wrong ips.

The problem is actually it seems that the node1(2001:400:a100:3090:f816:3eff:fe1c:385f) is rebooted and thus it lost three network links. Can you reattach the links for me.

For example on node with all the links it should be like this:

ubuntu@node9:~$ ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
link/ether fa:16:3e:56:ac:b7 brd ff:ff:ff:ff:ff:ff
3: ens8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 02:e1:a2:04:48:a3 brd ff:ff:ff:ff:ff:ff
4: ens7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 06:d3:95:0b:44:81 brd ff:ff:ff:ff:ff:ff
5: ens9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 0a:df:cf:c5:fd:f5 brd ff:ff:ff:ff:ff:ff

but on node one I get this

ubuntu@node1:~$ ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
link/ether fa:16:3e:1c:38:5f brd ff:ff:ff:ff:ff:ff
ubuntu@node1:~$

if you could reattach ens7, ens8, ens9 to node1, that would be great.

Thanks,

Fengping
July 24, 2023 at 5:22 pm in reply to: Long running slice stability issue. #4814
Fengping Hu
Participant
Hi Komal,

Thanks for looking into this and the note. It works nicely now!

Thanks,

Fengping
July 24, 2023 at 4:12 pm in reply to: Long running slice stability issue. #4810
Fengping Hu
Participant
Hi Komal,

Thanks for the help. I can login to both VMs via the management interface now. I can also see the 3 dataplane links are present on both VMs. However after I configured the IP on the public dataplane network(NET3 – so NIC with mac 02:7F:AE:44:CB:C9), it can’t reach gateway or other vms in the same network. I also tried to put this IP on the other two links and none of them works. Is there another step needed to make the links get attached to those networks?

Thanks,

Fenpging

root@node1:/home/ubuntu# ip -6 route
::1 dev lo proto kernel metric 256 pref medium
2001:400:a100:3090::/64 dev ens3 proto ra metric 100 expires 86337sec pref medium
2602:fcfb:1d:2::2 dev ens7 proto kernel metric 256 pref medium
fe80::a9fe:a9fe via fe80::f816:3eff:feac:1ca0 dev ens3 proto ra metric 1024 expires 237sec pref medium
fe80::/64 dev ens9 proto kernel metric 256 pref medium
fe80::/64 dev ens3 proto kernel metric 256 pref medium
fe80::/64 dev ens8 proto kernel metric 256 pref medium
fe80::/64 dev ens7 proto kernel metric 256 pref medium
default via fe80::f816:3eff:feac:1ca0 dev ens3 proto ra metric 100 expires 237sec mtu 9000 pref medium
root@node1:/home/ubuntu# ip -6 a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 state UNKNOWN qlen 1000
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 state UP qlen 1000
inet6 2001:400:a100:3090:f816:3eff:fe1c:385f/64 scope global dynamic mngtmpaddr noprefixroute
valid_lft 86327sec preferred_lft 14327sec
inet6 fe80::f816:3eff:fe1c:385f/64 scope link
valid_lft forever preferred_lft forever
3: ens7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
inet6 2602:fcfb:1d:2::2/128 scope global
valid_lft forever preferred_lft forever
inet6 fe80::7f:aeff:fe44:cbc9/64 scope link
valid_lft forever preferred_lft forever
4: ens8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
inet6 fe80::bc:a6ff:fe3f:c7cb/64 scope link
valid_lft forever preferred_lft forever
5: ens9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
inet6 fe80::4e3:d6ff:fe00:5b06/64 scope link
valid_lft forever preferred_lft forever
root@node1:/home/ubuntu# ping6 2602:fcfb:1d:2::4
PING 2602:fcfb:1d:2::4(2602:fcfb:1d:2::4) 56 data bytes
^C
— 2602:fcfb:1d:2::4 ping statistics —
5 packets transmitted, 0 received, 100% packet loss, time 4098ms
July 21, 2023 at 10:53 am in reply to: Long running slice stability issue. #4797
Fengping Hu
Participant
Hi Komal,

Thanks for looking up information for me. Can you kindly reboot both Node1 and Node2. For Node1, it seems I can’t regain access but a reboot would because that would clear the default route I added. For Node2 actually I was never able to login so hopefully a reboot can fix that as well.

Thanks,

Fengping
July 20, 2023 at 11:04 am in reply to: Long running slice stability issue. #4782
Fengping Hu
Participant
Hi Komal,

I configured the IP but it seems the link is not working. Unfortunately I changed default route to use this public dataplane link before I put in the policy based routing(which should make the management interface work), so I couldn’t access the vm now.

My theory is that the name of the links probably have changed, I assumed ens9 is NET3 but it probably that has changed after reboot. I will wait to see if RA will clean the default route so I can get in and do the fix. Otherwise I can rebuild the slice if it’s too much trouble.

Thanks,

Fengping
July 19, 2023 at 2:12 pm in reply to: Long running slice stability issue. #4775
Fengping Hu
Participant
Thank you for getting back to me in your busy schedules. It’s great that this is an known problem and a fix will be deployed.

The restarted VMs are accessible via the management interface now. But it seems it lost all the other links(we have 3 data plane links). Is there a way to reattach those links?

Thanks,

Fengping
June 22, 2023 at 10:57 pm in reply to: manual cleanup needed? #4582
Fengping Hu
Participant
Hi Komal,

Thank you so much for looking into the issue and quick fix. I will delete the slice and recreate tomorrow.

Appreciate your help:)

Fengping
June 22, 2023 at 1:19 pm in reply to: manual cleanup needed? #4578
Fengping Hu
Participant
Hi Komal,

It seems the slice lost public ipv6 network connection over night. I can’t even ping the gateway. The link lost the ips I configured statically even though I had disabled dhcp and ra for the link. So I tried to readd the ip and routes as well as tried both network3.change_public_ip(ipv6=list(map(str,networkips[0:50]))) and network3.make_ip_publicly_routable(ipv6=list(map(str,networkips[0:50]))) to try to make the ips public. But none seemed to work.

Any suggestion on how to fix this network.

Thanks,

Fengping

Here’s the slice information and symptons

slice ID
08d05419-e99b-4ebe-b4a1-88c07cf2bfa3
Name
ServiceXSlice

network id

06d92831-1f58-4548-9d24-9284b1273912
NET3
L3
FABNetv6Ext
CERN
2602:fcfb:1d:3::/64
2602:fcfb:1d:3::1
Active

buntu@node1:~$ ping6 2602:fcfb:1d:3::1
PING 2602:fcfb:1d:3::1(2602:fcfb:1d:3::1) 56 data bytes
^C
— 2602:fcfb:1d:3::1 ping statistics —
3 packets transmitted, 0 received, 100% packet loss, time 2056ms

ubuntu@node1:~$ ip -6 neigh | grep 2602
2602:fcfb:1d:3::7 dev ens9 lladdr 02:d2:f1:99:87:98 router REACHABLE
2602:fcfb:1d:3::9 dev ens9 lladdr 02:80:38:25:66:c0 router REACHABLE
2602:fcfb:1d:3::4 dev ens9 lladdr 02:1d:b9:31:e7:23 router STALE
2602:fcfb:1d:3::b dev ens9 lladdr 06:d3:95:0b:44:81 router REACHABLE
2602:fcfb:1d:3::6 dev ens9 lladdr 0a:b1:19:54:14:e7 router REACHABLE
2602:fcfb:1d:3::1 dev ens9 router FAILED
June 21, 2023 at 8:43 pm in reply to: manual cleanup needed? #4577
Fengping Hu
Participant
Hi Komal,

I tried your recipe and was able to create 10 vms with 60 cores each, but it failed to create 11 or 12 vms due to insufficient cpus. This is a bit counter intuitive since there were 766 cpus available and each of the 6 hosts should be able to run 2 vms. Nevertheless, we are in a better shape now with 600+ cores. Thank you so much for the help. I will try the new flavor when it’s available.

Thanks,

Fengping
June 20, 2023 at 4:47 pm in reply to: manual cleanup needed? #4574
Fengping Hu
Participant
Hi Komal,

Thanks for looking into this for me. This config – cores=’62’, ram=’384′, disk=’2000′ indeed works to create 6 vms. But it won’t work if I try to create 12 vms even if I request half ram(192) because of flavor mapping. So yes we do need a better flavor in my case. I may need only one big disk node to server as a xcache node, the rest of the nodes can have just limited disks unless we want to use all the disks to setup distributed storage(ceph etc)

Please let me know once you have discussed about this with your team and have recommendations. The goal is to allocate all the resources with not many vm flavors (one or two maybe).

Thanks,

Fengping
June 20, 2023 at 3:03 pm in reply to: manual cleanup needed? #4571
Fengping Hu
Participant
Hi Komal,

The CERN is sort of dedicated for ServiceX deployment. I will need to create my slice there for data access reasons. I don’t think there should be other slices at CERN other than the ServiceX slice I created. I would like to create big vms that basically map to physically machines. So 6 VMs for 6 physical machines at CERN.

I noticed the available CPUs are 408/768, so it’s 360 less than total which is exactly the number of cpus I requested for my slice this morning. This made me wonder if that slice is still holding up the resources. If the resources are not hold up by the dead slice but active slices, would you be able to relocate them so I can create my slice there?

Also what resource request should I use to make vm take up a whole physical machine?

Thanks,

Fengping
Author

Posts

Viewing 15 posts - 1 through 15 (of 41 total)

1 2 3 →