1. renew slice did not fully work

renew slice did not fully work

Home Forums FABRIC General Questions and Discussion renew slice did not fully work

Viewing 15 posts - 1 through 15 (of 15 total)
  • Author
    Posts
  • #3038
    Gregory Daues
    Participant

      I was able to execute ‘renew slice’ for my Slice, and it appeared to succeed, in that the Slice still seems to exist, it shows in the Portal display, and the Lease end appears to be extended

      MySliceSep12B   StableOK    2022-09-17 22:54:26

      However after the original Lease ends I see the nodes on the Slice to no longer be reachable via ssh,
      and the  OS Interface’s are now None.   So it appears that all elements of the Slice / Nodes are not maintained/extended.

      For some additional details, the interfaces of the Nodes now appear as

      --------------------- -----------------------
      Name CMBS4Node_ncsa1-nic1-p1
      Network net1
      Bandwidth 0
      VLAN
      MAC 02:FA:69:DF:46:DD
      Physical OS Interface
      OS Interface
      --------------------- -----------------------
      --------------------- -----------------------
      Name CMBS4Node_tacc2-nic2-p1
      Network net1
      Bandwidth 0
      VLAN
      MAC 06:34:B9:B2:55:E2
      Physical OS Interface
      OS Interface
      --------------------- -----------------------
      
      #3040
      Mina William Morcos
      Participant

        FABRIC has recently migrated to Google Cloud. Might this be because of the recent migration? Not sure. If so, then this issue may not happen again if you reserve a new slice.

        #3041
        Gregory Daues
        Participant

          That is a possible cause of disruption; I have a second Slice that will expire today, so I will try out the “renew slice” with this one, and see how that proceeds.

          #3049
          Ilya Baldin
          Participant

            Generally the Hub update should’ve had nothing to do with it. Let me know the slice ID of the slice and I can extend it with operator tools, in the meantime we will look if there is anything going on in extend operation.

            #3052
            Gregory Daues
            Participant

              The slice id is  4cb0a209-ca5e-4479-b0b7-e192fe257964   , and the Lease end is listed as 2022-09-17 22:54:26 .

               

              #3062
              Ilya Baldin
              Participant

                I decided to leave it as is since we have a couple of days – we will have someone look at this in the interim.

                #3076
                Komal Thareja
                Participant

                  Hello Gregory,

                  I can see both the VMs for your slice are ACTIVE. However, I am unable to SSH into them. I will seek help from operations team and keep you posted.

                  
                  Reservation ID: da5faa94-1c50-41b7-abf5-47578a82b87b Slice ID: 4cb0a209-ca5e-4479-b0b7-e192fe257964
                  Resource Type: VM Notices: Reservation da5faa94-1c50-41b7-abf5-47578a82b87b (Slice MySliceSep12B(4cb0a209-ca5e-4479-b0b7-e192fe257964) Graph Id:d854653d-cf7b-407d-ae4d-149c5113262b Owner:aroy59@asu.edu) is in state (Active,None_)
                  Start: 2022-09-13 15:50:24 +0000 End: 2022-09-18 03:54:26 +0000 Requested End: 2022-09-18 03:54:26 +0000
                  Units: 1 State: Active Pending State: None_
                  Sliver: {'capacities': '{ core: 32 , ram: 128 G, disk: 100 G}', 'capacity_allocations': '{ core: 32 , ram: 128 G, disk: 100 G}', 'capacity_hints': '{ instance_type: fabric.c32.m128.d100}', 'image_ref': 'default_rocky_8', 'image_type': 'qcow2', 'label_allocations': '{ instance: instance-0000132a, instance_parent: tacc-w3.fabric-testbed.net}', 'management_ip': '129.114.110.85', 'name': 'CMBS4Node_tacc2', 'node_map': "('508c3fa3-df17-41ab-bb95-fdf71c105a61', '8QQBZC3')", 'reservation_info': '{"error_message": "", "reservation_id": "da5faa94-1c50-41b7-abf5-47578a82b87b", "reservation_state": "Active"}', 'site': 'TACC', 'type': 'VM'}
                  ('CMBS4Node_tacc2-nic2', {'capacity_allocations': '{ unit: 1 }', 'details': 'Mellanox ConnectX-6 VPI MCX653 dual port 100Gbps', 'label_allocations': '{ bdf: 0000:e2:0f.6}', 'model': 'ConnectX-6', 'name': 'CMBS4Node_tacc2-nic2', 'node_map': "('508c3fa3-df17-41ab-bb95-fdf71c105a61', '8QQBZC3-slot7')", 'type': 'SharedNIC'})
                  
                  Reservation ID: 2571ddf7-f838-46b8-9095-ed0d36cfec55 Slice ID: 4cb0a209-ca5e-4479-b0b7-e192fe257964
                  Resource Type: L2STS Notices: Reservation 2571ddf7-f838-46b8-9095-ed0d36cfec55 (Slice MySliceSep12B(4cb0a209-ca5e-4479-b0b7-e192fe257964) Graph Id:d854653d-cf7b-407d-ae4d-149c5113262b Owner:aroy59@asu.edu) is in state (Active,None_)
                  Start: 2022-09-13 15:50:25 +0000 End: 2022-09-18 03:54:26 +0000 Requested End: 2022-09-18 03:54:26 +0000
                  Units: 1 State: Active Pending State: None_
                  Sliver: {'layer': 'L2', 'name': 'net1', 'node_map': "('508c3fa3-df17-41ab-bb95-fdf71c105a61', 'node+tacc-data-sw:ip+192.168.16.3-ns')", 'reservation_info': '{"error_message": "", "reservation_id": "2571ddf7-f838-46b8-9095-ed0d36cfec55", "reservation_state": "Active"}', 'type': 'L2STS'}
                  {'capacities': '{ unit: 1 }', 'label_allocations': '{ mac: 02:FA:69:DF:46:DD, vlan: 2121, local_name: HundredGigE0/0/0/5, device_name: ncsa-data-sw}', 'labels': '{ mac: 02:FA:69:DF:46:DD, vlan: 2121, local_name: HundredGigE0/0/0/5, device_name: ncsa-data-sw}', 'name': 'CMBS4Node_ncsa1-CMBS4Node_ncsa1-nic1-p1', 'node_map': "('508c3fa3-df17-41ab-bb95-fdf71c105a61', 'port+ncsa-data-sw:HundredGigE0/0/0/5')", 'type': 'ServicePort'}
                  {'capacities': '{ unit: 1 }', 'label_allocations': '{ mac: 06:34:B9:B2:55:E2, vlan: 2124, local_name: HundredGigE0/0/0/9, device_name: tacc-data-sw}', 'labels': '{ mac: 06:34:B9:B2:55:E2, vlan: 2124, local_name: HundredGigE0/0/0/9, device_name: tacc-data-sw}', 'name': 'CMBS4Node_tacc2-CMBS4Node_tacc2-nic2-p1', 'node_map': "('508c3fa3-df17-41ab-bb95-fdf71c105a61', 'port+tacc-data-sw:HundredGigE0/0/0/9')", 'type': 'ServicePort'}
                  
                  Reservation ID: 430e4832-f048-4368-b8b6-51ff6a5b6932 Slice ID: 4cb0a209-ca5e-4479-b0b7-e192fe257964
                  Resource Type: VM Notices: Reservation 430e4832-f048-4368-b8b6-51ff6a5b6932 (Slice MySliceSep12B(4cb0a209-ca5e-4479-b0b7-e192fe257964) Graph Id:d854653d-cf7b-407d-ae4d-149c5113262b Owner:aroy59@asu.edu) is in state (Active,None_)
                  Start: 2022-09-13 15:50:24 +0000 End: 2022-09-18 03:54:26 +0000 Requested End: 2022-09-18 03:54:26 +0000
                  Units: 1 State: Active Pending State: None_
                  Sliver: {'capacities': '{ core: 32 , ram: 128 G, disk: 100 G}', 'capacity_allocations': '{ core: 32 , ram: 128 G, disk: 100 G}', 'capacity_hints': '{ instance_type: fabric.c32.m128.d100}', 'image_ref': 'default_rocky_8', 'image_type': 'qcow2', 'label_allocations': '{ instance: instance-0000072a, instance_parent: ncsa-w1.fabric-testbed.net}', 'management_ip': '2620:0:c80:1001:f816:3eff:feef:a24c', 'name': 'CMBS4Node_ncsa1', 'node_map': "('508c3fa3-df17-41ab-bb95-fdf71c105a61', 'F1FSZB3')", 'reservation_info': '{"error_message": "", "reservation_id": "430e4832-f048-4368-b8b6-51ff6a5b6932", "reservation_state": "Active"}', 'site': 'NCSA', 'type': 'VM'}
                  ('CMBS4Node_ncsa1-nic1', {'capacity_allocations': '{ unit: 1 }', 'details': 'Mellanox ConnectX-6 VPI MCX653 dual port 100Gbps', 'label_allocations': '{ bdf: 0000:a1:1f.2}', 'model': 'ConnectX-6', 'name': 'CMBS4Node_ncsa1-nic1', 'node_map': "('508c3fa3-df17-41ab-bb95-fdf71c105a61', 'F1FSZB3-slot6')", 'type': 'SharedNIC'})
                  
                  
                  #3078
                  Komal Thareja
                  Participant

                    Also, from previous conversation, I found that the slice was renewed on Sep 8. Could you please let us know when did you loose SSH connectivity?

                    #3084
                    Ilya Baldin
                    Participant

                      Hello,

                      We looked into it. Based on the console messages this is the last thing that happened on both VMs (they are both running but inaccessible):

                      [ 1036.151967] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
                      [ 1036.157749] Bridge firewalling registered
                      [ 1036.585103] IPv6: ADDRCONF(NETDEV_UP): docker0: link is not ready
                      [ 1073.856805] docker0: port 1(veth1d3b361) entered blocking state
                      [ 1073.857386] docker0: port 1(veth1d3b361) entered disabled state
                      [ 1073.857932] device veth1d3b361 entered promiscuous mode
                      [ 1073.858555] IPv6: ADDRCONF(NETDEV_UP): veth1d3b361: link is not ready
                      [ 1073.859104] docker0: port 1(veth1d3b361) entered blocking state
                      [ 1073.859636] docker0: port 1(veth1d3b361) entered forwarding state
                      [ 1073.860777] docker0: port 1(veth1d3b361) entered disabled state
                      [ 1073.929285] cgroup: cgroup: disabling cgroup2 socket matching due to net_prio or net_cls activation
                      [ 1074.037119] eth0: renamed from veth77fda88
                      [ 1074.048595] IPv6: ADDRCONF(NETDEV_CHANGE): veth1d3b361: link becomes ready
                      [ 1074.049455] docker0: port 1(veth1d3b361) entered blocking state
                      [ 1074.050216] docker0: port 1(veth1d3b361) entered forwarding state
                      [ 1074.050968] IPv6: ADDRCONF(NETDEV_CHANGE): docker0: link becomes ready
                      [ 1074.155801] veth77fda88: renamed from eth0
                      [ 1074.176569] docker0: port 1(veth1d3b361) entered disabled state
                      [ 1074.179168] docker0: port 1(veth1d3b361) entered disabled state
                      [ 1074.180904] device veth1d3b361 left promiscuous mode
                      [ 1074.181585] docker0: port 1(veth1d3b361) entered disabled state

                      This suggests that Docker daemon stomped over the management network configuration in both VMs and made management interface inaccessible. We are currently unable to access these VMs to undo this, so the only suggestion I have is you create a new slice and be careful when turning Docker on.

                      #3087
                      Ilya Baldin
                      Participant

                        Also, don’t know if this is related to this problem, but you may find this article useful:

                        Network Interfaces in FABRIC VMs

                        #3092
                        Gregory Daues
                        Participant

                           

                          Yes, I can delete / let expire this particular Slice, it was just a matter of understanding what had happened to apply that to future Slices.   I will look into the issues with the Docker configuration. Thanks !

                          #3129
                          Gregory Daues
                          Participant

                            I think that I am still seeing the original issue. I have another Slice  MySliceSep18A  ( 1ae8fdff-9514-4042-a9af-e826d0c4b646 ) that was created yesterday.   The Slice was renewed  and the Lease End now states  2022-09-23 16:23:41 .
                            It is now around the time that the Slice was originally intended to expire,   and I see that I have lost the ability to ssh to the nodes.     The nodes of this Slice have no Docker installation at all, from the beginning.       Can this be examined in any way?

                             

                            #3131
                            Ilya Baldin
                            Participant

                              OK, I’m not seeing anything obvious – your slice is good through 2022-09-23 21:23:41+00:00, but at least the TACC VM is not responding to pings. We will create a ticket for this and copy you on it.

                              #3157
                              Gregory Daues
                              Participant

                                Following up with latest test results. I did a very synthetic test, Started up a Slice
                                MySliceSep22A  5e995249-8f5b-45b4-ac11-6b968e9a3f66
                                with a single node at a site (MICH).  No L2/L3 networks added, no additional software installs etc.
                                I was able to log in with

                                ssh -F ~/.ssh/fabric-ssh-config -i ${FABRIC_SLICE_PRIVATE_KEY_FILE}   rocky@2607:f018:110:11:f816:3eff:fe9e:4eb4

                                for the first day. Original enddate was  2022-09-23 10:21:47 , extended enddate 2022-09-25 19:56:40  .
                                This node of the slice is now unreachable

                                > ssh -F ~/.ssh/fabric-ssh-config -i ${FABRIC_SLICE_PRIVATE_KEY_FILE}   rocky@2607:f018:110:11:f816:3eff:fe9e:4eb4

                                Warning: Permanently added ‘bastion-1.fabric-testbed.net,2600:2701:5000:a902::c’ (ECDSA) to the list of known hosts.

                                channel 0: open failed: connect failed: No route to host

                                stdio forwarding failed

                                kex_exchange_identification: Connection closed by remote host

                                Each day I generate a new token from the Fabric credential manager; hopefully this is not any issue  of needing to keep an original token going for the lifetime of the Slice (not even sure if that is possible.)

                                 

                                #3158
                                Ilya Baldin
                                Participant

                                  Would you mind posting this to FIP-153 (responding to the email)? We are tracking this case there.

                                Viewing 15 posts - 1 through 15 (of 15 total)
                                • You must be logged in to reply to this topic.