STAR site power loss, connectivity losses

This topic has 10 replies, 4 voices, and was last updated 1 year, 10 months ago by Mert Cevik.

Viewing 11 posts - 1 through 11 (of 11 total)

Author

Posts
September 17, 2023 at 10:11 pm #5328
Ilya Baldin
Participant
Dear experimenters,

Due to a pipe break in a building adjacent to the one hosting our STAR site in Chicago, the electrical power has been turned off and the site is down. Also down are links to INDI, NCSA, EDC and MICH as well as Chameleon Chicago facility port, which all go through STAR (which means they are currently disconnected from the rest of the FABRIC dataplane). Due to the severity of the incident we do not know the estimated time to repair, we have been told an update will be provided by noon Central time tomorrow (Monday).
September 17, 2023 at 10:19 pm #5329
Ilya Baldin
Participant
In addition to that CloudLab Wisconsin Facility Port is also now disconnected from the rest of FABRIC.

The following sites have been placed into maintenance: STAR, NCSA, INDI, EDC until we know more.
September 19, 2023 at 10:45 am #5345
Ilya Baldin
Participant
Some of the equipment connecting STAR site to the FABRIC backbone has sustained damage and is being replaced. We will continue updating this thread as more information becomes available.
September 19, 2023 at 11:59 am #5347
yoursunny
Participant
The STAR outage seems to be affecting the creation of FABNetv4Ext networks. It seems that the control software is trying to access the STAR switch and it times out. This occurs even if the node is in WASH site where the FABNetv4Ext peering connection exists.

Slice Exception: Slice Name: v4gateway@1695137544, Slice ID: f20f1cff-11b0-4db9-9ffb-5b265c3653b6: Slice Exception: Slice Name: v4gateway@1695137544, Slice ID: f20f1cff-11b0-4db9-9ffb-5b265c3653b6: Node: gateway, Site: PSC, State: Active,
Slice Exception: Slice Name: v4gateway@1695137544, Slice ID: f20f1cff-11b0-4db9-9ffb-5b265c3653b6: Slice Exception: Slice Name: v4gateway@1695137544, Slice ID: f20f1cff-11b0-4db9-9ffb-5b265c3653b6: Node: gateway, Site: PSC, State: Active,

failed lease update- all units failed priming: Exception during modify for unit: 5a8383f3-30aa-41d8-9874-46b61ebbe621 Playbook has failed tasks: NSO commit returned JSON-RPC error: type: rpc.method.failed, code: -32000, message: Method failed, data: message: Failed to connect to device star-data-sw: connection refused: NEDCOM CONNECT: The kexTimeout (20000 ms) expired. in new state, internal: jsonrpc_tx_commit357#all units failed priming: Exception during modify for unit: 5a8383f3-30aa-41d8-9874-46b61ebbe621 Playbook has failed tasks: NSO commit returned JSON-RPC error: type: rpc.method.failed, code: -32000, message: Method failed, data: message: Failed to connect to device star-data-sw: connection refused: NEDCOM CONNECT: The kexTimeout (20000 ms) expired. in new state, internal: jsonrpc_tx_commit357#

The control software should choose alternate paths to reach the peering port. The control software should skip switches in maintenance, and attempt to re-apply the configuration when the maintenance mode is lifted.
September 19, 2023 at 12:07 pm #5348
Ilya Baldin
Participant
We need to look into why this happens, thank you for reporting it.
September 20, 2023 at 7:29 am #5350
Ilya Baldin
Participant
We updated the configuration so the system should use the WASH peering exclusively.
September 20, 2023 at 12:10 pm #5352
yoursunny
Participant
FABNetv4Ext establishment is working, but I’m see connectivity issues to many destinations.
```
ubuntu@v4gateway:~$ mtr -4bwz -c4 --tcp -P 6363 hobo.cs.arizona.edu
Start: 2023-09-20T16:06:38+0000
HOST: v4gateway                                                                     Loss%   Snt   Last   Avg  Best  Wrst StDev
  1. AS398900 23.134.233.81                                                          0.0%     4    0.5   0.5   0.5   0.5   0.0
  2. AS???    10.133.0.141                                                           0.0%     4   13.2  13.2  13.1  13.2   0.0
  3. AS11537  hundredge-0-0-0-28.1000.core1.wash.net.internet2.edu (198.71.45.162)   0.0%     4   15.4  15.1  14.5  15.7   0.5
  4. AS???    ???                                                                   100.0     4    0.0   0.0   0.0   0.0   0.0

ubuntu@v4gateway:~$ mtr -4bwz -c4 --tcp -P 5201 ash.speedtest.clouvider.net
Start: 2023-09-20T16:07:49+0000
HOST: v4gateway              Loss%   Snt   Last   Avg  Best  Wrst StDev
  1. AS398900 23.134.233.81   0.0%     4    0.5   0.5   0.5   0.5   0.0
  2. AS???    10.133.0.141    0.0%     4   13.4  13.2  13.1  13.4   0.1
  3. AS???    ???            100.0     4    0.0   0.0   0.0   0.0   0.0
```
Maybe some routing adjustment is needed too?
September 20, 2023 at 4:28 pm #5362
Paul Ruth
Keymaster
I was experiencing the same problem. It should be fixed now.

It was related to the damaged machines at starlight. All of the ext services should now be moved to the WASH site.

Paul
September 22, 2023 at 3:16 pm #5417
Ilya Baldin
Participant
Dear experimenters,

The majority of repairs at StarLight have been performed. FABRIC Dataplane is functioning again, EDC and INDI sites are being released from maintenance. We expect STAR and NCSA to come back as well soon.
September 22, 2023 at 3:28 pm #5418
Ilya Baldin
Participant
NCSA is also now online.
September 22, 2023 at 10:40 pm #5429
Mert Cevik
Moderator
STAR is online.
Author

Posts

Viewing 11 posts - 1 through 11 (of 11 total)

The topic ‘STAR site power loss, connectivity losses’ is closed to new replies.