Home › Forums › FABRIC General Questions and Discussion › renew slice did not fully work
- This topic has 14 replies, 4 voices, and was last updated 2 years, 2 months ago by Ilya Baldin.
-
AuthorPosts
-
September 15, 2022 at 9:45 am #3038
I was able to execute ‘renew slice’ for my Slice, and it appeared to succeed, in that the Slice still seems to exist, it shows in the Portal display, and the Lease end appears to be extended
MySliceSep12B StableOK 2022-09-17 22:54:26
However after the original Lease ends I see the nodes on the Slice to no longer be reachable via ssh,
and the OS Interface’s are now None. So it appears that all elements of the Slice / Nodes are not maintained/extended.For some additional details, the interfaces of the Nodes now appear as
--------------------- ----------------------- Name CMBS4Node_ncsa1-nic1-p1 Network net1 Bandwidth 0 VLAN MAC 02:FA:69:DF:46:DD Physical OS Interface OS Interface --------------------- ----------------------- --------------------- ----------------------- Name CMBS4Node_tacc2-nic2-p1 Network net1 Bandwidth 0 VLAN MAC 06:34:B9:B2:55:E2 Physical OS Interface OS Interface --------------------- -----------------------
September 15, 2022 at 9:54 am #3040FABRIC has recently migrated to Google Cloud. Might this be because of the recent migration? Not sure. If so, then this issue may not happen again if you reserve a new slice.
September 15, 2022 at 9:57 am #3041That is a possible cause of disruption; I have a second Slice that will expire today, so I will try out the “renew slice” with this one, and see how that proceeds.
September 15, 2022 at 12:50 pm #3049Generally the Hub update should’ve had nothing to do with it. Let me know the slice ID of the slice and I can extend it with operator tools, in the meantime we will look if there is anything going on in extend operation.
September 15, 2022 at 1:36 pm #3052The slice id is 4cb0a209-ca5e-4479-b0b7-e192fe257964 , and the Lease end is listed as 2022-09-17 22:54:26 .
September 15, 2022 at 2:43 pm #3062I decided to leave it as is since we have a couple of days – we will have someone look at this in the interim.
September 15, 2022 at 8:51 pm #3076Hello Gregory,
I can see both the VMs for your slice are ACTIVE. However, I am unable to SSH into them. I will seek help from operations team and keep you posted.
Reservation ID: da5faa94-1c50-41b7-abf5-47578a82b87b Slice ID: 4cb0a209-ca5e-4479-b0b7-e192fe257964 Resource Type: VM Notices: Reservation da5faa94-1c50-41b7-abf5-47578a82b87b (Slice MySliceSep12B(4cb0a209-ca5e-4479-b0b7-e192fe257964) Graph Id:d854653d-cf7b-407d-ae4d-149c5113262b Owner:aroy59@asu.edu) is in state (Active,None_) Start: 2022-09-13 15:50:24 +0000 End: 2022-09-18 03:54:26 +0000 Requested End: 2022-09-18 03:54:26 +0000 Units: 1 State: Active Pending State: None_ Sliver: {'capacities': '{ core: 32 , ram: 128 G, disk: 100 G}', 'capacity_allocations': '{ core: 32 , ram: 128 G, disk: 100 G}', 'capacity_hints': '{ instance_type: fabric.c32.m128.d100}', 'image_ref': 'default_rocky_8', 'image_type': 'qcow2', 'label_allocations': '{ instance: instance-0000132a, instance_parent: tacc-w3.fabric-testbed.net}', 'management_ip': '129.114.110.85', 'name': 'CMBS4Node_tacc2', 'node_map': "('508c3fa3-df17-41ab-bb95-fdf71c105a61', '8QQBZC3')", 'reservation_info': '{"error_message": "", "reservation_id": "da5faa94-1c50-41b7-abf5-47578a82b87b", "reservation_state": "Active"}', 'site': 'TACC', 'type': 'VM'} ('CMBS4Node_tacc2-nic2', {'capacity_allocations': '{ unit: 1 }', 'details': 'Mellanox ConnectX-6 VPI MCX653 dual port 100Gbps', 'label_allocations': '{ bdf: 0000:e2:0f.6}', 'model': 'ConnectX-6', 'name': 'CMBS4Node_tacc2-nic2', 'node_map': "('508c3fa3-df17-41ab-bb95-fdf71c105a61', '8QQBZC3-slot7')", 'type': 'SharedNIC'}) Reservation ID: 2571ddf7-f838-46b8-9095-ed0d36cfec55 Slice ID: 4cb0a209-ca5e-4479-b0b7-e192fe257964 Resource Type: L2STS Notices: Reservation 2571ddf7-f838-46b8-9095-ed0d36cfec55 (Slice MySliceSep12B(4cb0a209-ca5e-4479-b0b7-e192fe257964) Graph Id:d854653d-cf7b-407d-ae4d-149c5113262b Owner:aroy59@asu.edu) is in state (Active,None_) Start: 2022-09-13 15:50:25 +0000 End: 2022-09-18 03:54:26 +0000 Requested End: 2022-09-18 03:54:26 +0000 Units: 1 State: Active Pending State: None_ Sliver: {'layer': 'L2', 'name': 'net1', 'node_map': "('508c3fa3-df17-41ab-bb95-fdf71c105a61', 'node+tacc-data-sw:ip+192.168.16.3-ns')", 'reservation_info': '{"error_message": "", "reservation_id": "2571ddf7-f838-46b8-9095-ed0d36cfec55", "reservation_state": "Active"}', 'type': 'L2STS'} {'capacities': '{ unit: 1 }', 'label_allocations': '{ mac: 02:FA:69:DF:46:DD, vlan: 2121, local_name: HundredGigE0/0/0/5, device_name: ncsa-data-sw}', 'labels': '{ mac: 02:FA:69:DF:46:DD, vlan: 2121, local_name: HundredGigE0/0/0/5, device_name: ncsa-data-sw}', 'name': 'CMBS4Node_ncsa1-CMBS4Node_ncsa1-nic1-p1', 'node_map': "('508c3fa3-df17-41ab-bb95-fdf71c105a61', 'port+ncsa-data-sw:HundredGigE0/0/0/5')", 'type': 'ServicePort'} {'capacities': '{ unit: 1 }', 'label_allocations': '{ mac: 06:34:B9:B2:55:E2, vlan: 2124, local_name: HundredGigE0/0/0/9, device_name: tacc-data-sw}', 'labels': '{ mac: 06:34:B9:B2:55:E2, vlan: 2124, local_name: HundredGigE0/0/0/9, device_name: tacc-data-sw}', 'name': 'CMBS4Node_tacc2-CMBS4Node_tacc2-nic2-p1', 'node_map': "('508c3fa3-df17-41ab-bb95-fdf71c105a61', 'port+tacc-data-sw:HundredGigE0/0/0/9')", 'type': 'ServicePort'} Reservation ID: 430e4832-f048-4368-b8b6-51ff6a5b6932 Slice ID: 4cb0a209-ca5e-4479-b0b7-e192fe257964 Resource Type: VM Notices: Reservation 430e4832-f048-4368-b8b6-51ff6a5b6932 (Slice MySliceSep12B(4cb0a209-ca5e-4479-b0b7-e192fe257964) Graph Id:d854653d-cf7b-407d-ae4d-149c5113262b Owner:aroy59@asu.edu) is in state (Active,None_) Start: 2022-09-13 15:50:24 +0000 End: 2022-09-18 03:54:26 +0000 Requested End: 2022-09-18 03:54:26 +0000 Units: 1 State: Active Pending State: None_ Sliver: {'capacities': '{ core: 32 , ram: 128 G, disk: 100 G}', 'capacity_allocations': '{ core: 32 , ram: 128 G, disk: 100 G}', 'capacity_hints': '{ instance_type: fabric.c32.m128.d100}', 'image_ref': 'default_rocky_8', 'image_type': 'qcow2', 'label_allocations': '{ instance: instance-0000072a, instance_parent: ncsa-w1.fabric-testbed.net}', 'management_ip': '2620:0:c80:1001:f816:3eff:feef:a24c', 'name': 'CMBS4Node_ncsa1', 'node_map': "('508c3fa3-df17-41ab-bb95-fdf71c105a61', 'F1FSZB3')", 'reservation_info': '{"error_message": "", "reservation_id": "430e4832-f048-4368-b8b6-51ff6a5b6932", "reservation_state": "Active"}', 'site': 'NCSA', 'type': 'VM'} ('CMBS4Node_ncsa1-nic1', {'capacity_allocations': '{ unit: 1 }', 'details': 'Mellanox ConnectX-6 VPI MCX653 dual port 100Gbps', 'label_allocations': '{ bdf: 0000:a1:1f.2}', 'model': 'ConnectX-6', 'name': 'CMBS4Node_ncsa1-nic1', 'node_map': "('508c3fa3-df17-41ab-bb95-fdf71c105a61', 'F1FSZB3-slot6')", 'type': 'SharedNIC'})
- This reply was modified 2 years, 2 months ago by Komal Thareja.
September 15, 2022 at 9:38 pm #3078Also, from previous conversation, I found that the slice was renewed on Sep 8. Could you please let us know when did you loose SSH connectivity?
September 16, 2022 at 4:09 pm #3084Hello,
We looked into it. Based on the console messages this is the last thing that happened on both VMs (they are both running but inaccessible):
[ 1036.151967] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
[ 1036.157749] Bridge firewalling registered
[ 1036.585103] IPv6: ADDRCONF(NETDEV_UP): docker0: link is not ready
[ 1073.856805] docker0: port 1(veth1d3b361) entered blocking state
[ 1073.857386] docker0: port 1(veth1d3b361) entered disabled state
[ 1073.857932] device veth1d3b361 entered promiscuous mode
[ 1073.858555] IPv6: ADDRCONF(NETDEV_UP): veth1d3b361: link is not ready
[ 1073.859104] docker0: port 1(veth1d3b361) entered blocking state
[ 1073.859636] docker0: port 1(veth1d3b361) entered forwarding state
[ 1073.860777] docker0: port 1(veth1d3b361) entered disabled state
[ 1073.929285] cgroup: cgroup: disabling cgroup2 socket matching due to net_prio or net_cls activation
[ 1074.037119] eth0: renamed from veth77fda88
[ 1074.048595] IPv6: ADDRCONF(NETDEV_CHANGE): veth1d3b361: link becomes ready
[ 1074.049455] docker0: port 1(veth1d3b361) entered blocking state
[ 1074.050216] docker0: port 1(veth1d3b361) entered forwarding state
[ 1074.050968] IPv6: ADDRCONF(NETDEV_CHANGE): docker0: link becomes ready
[ 1074.155801] veth77fda88: renamed from eth0
[ 1074.176569] docker0: port 1(veth1d3b361) entered disabled state
[ 1074.179168] docker0: port 1(veth1d3b361) entered disabled state
[ 1074.180904] device veth1d3b361 left promiscuous mode
[ 1074.181585] docker0: port 1(veth1d3b361) entered disabled stateThis suggests that Docker daemon stomped over the management network configuration in both VMs and made management interface inaccessible. We are currently unable to access these VMs to undo this, so the only suggestion I have is you create a new slice and be careful when turning Docker on.
September 16, 2022 at 5:58 pm #3087Also, don’t know if this is related to this problem, but you may find this article useful:
September 16, 2022 at 7:46 pm #3092Yes, I can delete / let expire this particular Slice, it was just a matter of understanding what had happened to apply that to future Slices. I will look into the issues with the Docker configuration. Thanks !
September 19, 2022 at 2:09 pm #3129I think that I am still seeing the original issue. I have another Slice MySliceSep18A ( 1ae8fdff-9514-4042-a9af-e826d0c4b646 ) that was created yesterday. The Slice was renewed and the Lease End now states 2022-09-23 16:23:41 .
It is now around the time that the Slice was originally intended to expire, and I see that I have lost the ability to ssh to the nodes. The nodes of this Slice have no Docker installation at all, from the beginning. Can this be examined in any way?September 19, 2022 at 3:06 pm #3131OK, I’m not seeing anything obvious – your slice is good through 2022-09-23 21:23:41+00:00, but at least the TACC VM is not responding to pings. We will create a ticket for this and copy you on it.
September 23, 2022 at 11:35 am #3157Following up with latest test results. I did a very synthetic test, Started up a Slice
MySliceSep22A 5e995249-8f5b-45b4-ac11-6b968e9a3f66
with a single node at a site (MICH). No L2/L3 networks added, no additional software installs etc.
I was able to log in withssh -F ~/.ssh/fabric-ssh-config -i ${FABRIC_SLICE_PRIVATE_KEY_FILE} rocky@2607:f018:110:11:f816:3eff:fe9e:4eb4
for the first day. Original enddate was 2022-09-23 10:21:47 , extended enddate 2022-09-25 19:56:40 .
This node of the slice is now unreachable> ssh -F ~/.ssh/fabric-ssh-config -i ${FABRIC_SLICE_PRIVATE_KEY_FILE} rocky@2607:f018:110:11:f816:3eff:fe9e:4eb4
Warning: Permanently added ‘bastion-1.fabric-testbed.net,2600:2701:5000:a902::c’ (ECDSA) to the list of known hosts.
channel 0: open failed: connect failed: No route to host
stdio forwarding failed
kex_exchange_identification: Connection closed by remote host
Each day I generate a new token from the Fabric credential manager; hopefully this is not any issue of needing to keep an original token going for the lifetime of the Slice (not even sure if that is possible.)
September 23, 2022 at 12:54 pm #3158Would you mind posting this to FIP-153 (responding to the email)? We are tracking this case there.
-
AuthorPosts
- You must be logged in to reply to this topic.