Home › Forums › FABRIC Announcements › STAR site power loss, connectivity losses
- This topic has 10 replies, 4 voices, and was last updated 1 year, 2 months ago by Mert Cevik.
-
AuthorPosts
-
September 17, 2023 at 10:11 pm #5328
Dear experimenters,
Due to a pipe break in a building adjacent to the one hosting our STAR site in Chicago, the electrical power has been turned off and the site is down. Also down are links to INDI, NCSA, EDC and MICH as well as Chameleon Chicago facility port, which all go through STAR (which means they are currently disconnected from the rest of the FABRIC dataplane). Due to the severity of the incident we do not know the estimated time to repair, we have been told an update will be provided by noon Central time tomorrow (Monday).
September 17, 2023 at 10:19 pm #5329In addition to that CloudLab Wisconsin Facility Port is also now disconnected from the rest of FABRIC.
The following sites have been placed into maintenance: STAR, NCSA, INDI, EDC until we know more.
September 19, 2023 at 10:45 am #5345Some of the equipment connecting STAR site to the FABRIC backbone has sustained damage and is being replaced. We will continue updating this thread as more information becomes available.
September 19, 2023 at 11:59 am #5347The STAR outage seems to be affecting the creation of FABNetv4Ext networks. It seems that the control software is trying to access the STAR switch and it times out. This occurs even if the node is in WASH site where the FABNetv4Ext peering connection exists.
Slice Exception: Slice Name: v4gateway@1695137544, Slice ID: f20f1cff-11b0-4db9-9ffb-5b265c3653b6: Slice Exception: Slice Name: v4gateway@1695137544, Slice ID: f20f1cff-11b0-4db9-9ffb-5b265c3653b6: Node: gateway, Site: PSC, State: Active,
Slice Exception: Slice Name: v4gateway@1695137544, Slice ID: f20f1cff-11b0-4db9-9ffb-5b265c3653b6: Slice Exception: Slice Name: v4gateway@1695137544, Slice ID: f20f1cff-11b0-4db9-9ffb-5b265c3653b6: Node: gateway, Site: PSC, State: Active,failed lease update- all units failed priming: Exception during modify for unit: 5a8383f3-30aa-41d8-9874-46b61ebbe621 Playbook has failed tasks: NSO commit returned JSON-RPC error: type: rpc.method.failed, code: -32000, message: Method failed, data: message: Failed to connect to device star-data-sw: connection refused: NEDCOM CONNECT: The kexTimeout (20000 ms) expired. in new state, internal: jsonrpc_tx_commit357#all units failed priming: Exception during modify for unit: 5a8383f3-30aa-41d8-9874-46b61ebbe621 Playbook has failed tasks: NSO commit returned JSON-RPC error: type: rpc.method.failed, code: -32000, message: Method failed, data: message: Failed to connect to device star-data-sw: connection refused: NEDCOM CONNECT: The kexTimeout (20000 ms) expired. in new state, internal: jsonrpc_tx_commit357#
The control software should choose alternate paths to reach the peering port. The control software should skip switches in maintenance, and attempt to re-apply the configuration when the maintenance mode is lifted.
September 19, 2023 at 12:07 pm #5348We need to look into why this happens, thank you for reporting it.
September 20, 2023 at 7:29 am #5350We updated the configuration so the system should use the WASH peering exclusively.
September 20, 2023 at 12:10 pm #5352FABNetv4Ext establishment is working, but I’m see connectivity issues to many destinations.
ubuntu@v4gateway:~$ mtr -4bwz -c4 --tcp -P 6363 hobo.cs.arizona.edu Start: 2023-09-20T16:06:38+0000 HOST: v4gateway Loss% Snt Last Avg Best Wrst StDev 1. AS398900 23.134.233.81 0.0% 4 0.5 0.5 0.5 0.5 0.0 2. AS??? 10.133.0.141 0.0% 4 13.2 13.2 13.1 13.2 0.0 3. AS11537 hundredge-0-0-0-28.1000.core1.wash.net.internet2.edu (198.71.45.162) 0.0% 4 15.4 15.1 14.5 15.7 0.5 4. AS??? ??? 100.0 4 0.0 0.0 0.0 0.0 0.0 ubuntu@v4gateway:~$ mtr -4bwz -c4 --tcp -P 5201 ash.speedtest.clouvider.net Start: 2023-09-20T16:07:49+0000 HOST: v4gateway Loss% Snt Last Avg Best Wrst StDev 1. AS398900 23.134.233.81 0.0% 4 0.5 0.5 0.5 0.5 0.0 2. AS??? 10.133.0.141 0.0% 4 13.4 13.2 13.1 13.4 0.1 3. AS??? ??? 100.0 4 0.0 0.0 0.0 0.0 0.0
Maybe some routing adjustment is needed too?
September 20, 2023 at 4:28 pm #5362I was experiencing the same problem. It should be fixed now.
It was related to the damaged machines at starlight. All of the ext services should now be moved to the WASH site.
Paul
September 22, 2023 at 3:16 pm #5417Dear experimenters,
The majority of repairs at StarLight have been performed. FABRIC Dataplane is functioning again, EDC and INDI sites are being released from maintenance. We expect STAR and NCSA to come back as well soon.
September 22, 2023 at 3:28 pm #5418NCSA is also now online.
September 22, 2023 at 10:40 pm #5429STAR is online.
-
AuthorPosts
- The topic ‘STAR site power loss, connectivity losses’ is closed to new replies.