1. STAR site power loss, connectivity losses

STAR site power loss, connectivity losses

Home Forums FABRIC Announcements STAR site power loss, connectivity losses

Viewing 11 posts - 1 through 11 (of 11 total)
  • Author
    Posts
  • #5328
    Ilya Baldin
    Participant

      Dear experimenters,

      Due to a pipe break in a building adjacent to the one hosting our STAR site in Chicago, the electrical power has been turned off and the site is down. Also down are links to INDI, NCSA, EDC and MICH as well as Chameleon Chicago facility port, which all go through STAR (which means they are currently disconnected from the rest of the FABRIC dataplane). Due to the severity of the incident we do not know the estimated time to repair, we have been told an update will be provided by noon Central time tomorrow (Monday).

      #5329
      Ilya Baldin
      Participant

        In addition to that CloudLab Wisconsin Facility Port is also now disconnected from the rest of FABRIC.

         

        The following sites have been placed into maintenance: STAR, NCSA, INDI, EDC until we know more.

        #5345
        Ilya Baldin
        Participant

          Some of the equipment connecting STAR site to the FABRIC backbone has sustained damage and is being replaced. We will continue updating this thread as more information becomes available.

          #5347
          yoursunny
          Participant

            The STAR outage seems to be affecting the creation of FABNetv4Ext networks. It seems that the control software is trying to access the STAR switch and it times out. This occurs even if the node is in WASH site where the FABNetv4Ext peering connection exists.

            Slice Exception: Slice Name: v4gateway@1695137544, Slice ID: f20f1cff-11b0-4db9-9ffb-5b265c3653b6: Slice Exception: Slice Name: v4gateway@1695137544, Slice ID: f20f1cff-11b0-4db9-9ffb-5b265c3653b6: Node: gateway, Site: PSC, State: Active,
            Slice Exception: Slice Name: v4gateway@1695137544, Slice ID: f20f1cff-11b0-4db9-9ffb-5b265c3653b6: Slice Exception: Slice Name: v4gateway@1695137544, Slice ID: f20f1cff-11b0-4db9-9ffb-5b265c3653b6: Node: gateway, Site: PSC, State: Active,

            failed lease update- all units failed priming: Exception during modify for unit: 5a8383f3-30aa-41d8-9874-46b61ebbe621 Playbook has failed tasks: NSO commit returned JSON-RPC error: type: rpc.method.failed, code: -32000, message: Method failed, data: message: Failed to connect to device star-data-sw: connection refused: NEDCOM CONNECT: The kexTimeout (20000 ms) expired. in new state, internal: jsonrpc_tx_commit357#all units failed priming: Exception during modify for unit: 5a8383f3-30aa-41d8-9874-46b61ebbe621 Playbook has failed tasks: NSO commit returned JSON-RPC error: type: rpc.method.failed, code: -32000, message: Method failed, data: message: Failed to connect to device star-data-sw: connection refused: NEDCOM CONNECT: The kexTimeout (20000 ms) expired. in new state, internal: jsonrpc_tx_commit357#

            The control software should choose alternate paths to reach the peering port. The control software should skip switches in maintenance, and attempt to re-apply the configuration when the maintenance mode is lifted.

            #5348
            Ilya Baldin
            Participant

              We need to look into why this happens, thank you for reporting it.

              #5350
              Ilya Baldin
              Participant

                We updated the configuration so the system should use the WASH peering exclusively.

                #5352
                yoursunny
                Participant

                  FABNetv4Ext establishment is working, but I’m see connectivity issues to many destinations.

                  ubuntu@v4gateway:~$ mtr -4bwz -c4 --tcp -P 6363 hobo.cs.arizona.edu
                  Start: 2023-09-20T16:06:38+0000
                  HOST: v4gateway                                                                     Loss%   Snt   Last   Avg  Best  Wrst StDev
                    1. AS398900 23.134.233.81                                                          0.0%     4    0.5   0.5   0.5   0.5   0.0
                    2. AS???    10.133.0.141                                                           0.0%     4   13.2  13.2  13.1  13.2   0.0
                    3. AS11537  hundredge-0-0-0-28.1000.core1.wash.net.internet2.edu (198.71.45.162)   0.0%     4   15.4  15.1  14.5  15.7   0.5
                    4. AS???    ???                                                                   100.0     4    0.0   0.0   0.0   0.0   0.0
                  
                  ubuntu@v4gateway:~$ mtr -4bwz -c4 --tcp -P 5201 ash.speedtest.clouvider.net
                  Start: 2023-09-20T16:07:49+0000
                  HOST: v4gateway              Loss%   Snt   Last   Avg  Best  Wrst StDev
                    1. AS398900 23.134.233.81   0.0%     4    0.5   0.5   0.5   0.5   0.0
                    2. AS???    10.133.0.141    0.0%     4   13.4  13.2  13.1  13.4   0.1
                    3. AS???    ???            100.0     4    0.0   0.0   0.0   0.0   0.0
                  

                  Maybe some routing adjustment is needed too?

                  #5362
                  Paul Ruth
                  Keymaster

                    I was experiencing the same problem. It should be fixed now.

                    It was related to the damaged machines at starlight.  All of the ext services should now be moved to the WASH site.

                    Paul

                    #5417
                    Ilya Baldin
                    Participant

                      Dear experimenters,

                      The majority of repairs at StarLight have been performed. FABRIC Dataplane is functioning again, EDC and INDI sites are being released from maintenance. We expect STAR and NCSA to come back as well soon.

                      #5418
                      Ilya Baldin
                      Participant

                        NCSA is also now online.

                        #5429
                        Mert Cevik
                        Moderator

                          STAR is online.

                        Viewing 11 posts - 1 through 11 (of 11 total)
                        • The topic ‘STAR site power loss, connectivity losses’ is closed to new replies.