1. Multiple problems in FABRIC [Partially Resolved]

Multiple problems in FABRIC [Partially Resolved]

Home Forums FABRIC Announcements Multiple problems in FABRIC [Partially Resolved]

Viewing 5 posts - 1 through 5 (of 5 total)
  • Author
    Posts
  • #4249
    Ilya Baldin
    Participant

      Dear experimenters,

      We appear to be experiencing multiple simultaneous problems with the testbed:

      – There is a network provider hardware failure at WASH site (site has been placed into maintenance until new equipment is installed)  (RESOLVED)

      – SRI management is not working reliably (site has been placed into maintenance)

      – GATech worker 3 and STAR worker 6 (resolved) have malfunctioned and have also been  placed into maintenance

      – In addition, we are experiencing problems with Kafka message delivery across all sites (we are investigating) (resolved)

      For the moment we will place the testbed into maintenance mode until we know more. Existing slices should not be affected, unless there is a management network problem at the site. We will post updates on this thread as we know more.

      • This topic was modified 1 year, 6 months ago by Ilya Baldin.
      • This topic was modified 1 year, 6 months ago by Hussam Nasir.
      #4251
      Ilya Baldin
      Participant

        Dear experimenters,

        We believe we have gotten to the bottom of the Kafka issues and the testbed is reopening. SRI will remain in maintenance, WASH will also remain in maintenance. Workers at GATech (#3) and STAR (#6) will be in maintenance and note that we have a number of planned outages for workers across multiple sites to install FPGAs (these were mentioned on previous announcements) in the past few days. As the work gets completed, those will be taken off maintenance in due course.

        #4266
        Hussam Nasir
        Moderator

          WASH is back online

          star-w6 issue was resolved

          SRI MTU issue is being worked on but this rack is not in production yet.

          #4315
          Mert Cevik
          Moderator

            As an update on this:

            – SRI did not complete acceptance tests (site has been placed into maintenance) and should not be used for experiments.
            – GATech worker-2 (GPU worker) has a hardware fault and it’s placed into maintenance. Other GATECH resources are available for experiments.

            #4465
            Mert Cevik
            Moderator

              Update:

              • GATECH worker-2 (GPU-worker) problem is resolved.
              • Status of SRI will be posted separately. It will remain in maintenance until after the general maintenance next week.
            Viewing 5 posts - 1 through 5 (of 5 total)
            • The topic ‘Multiple problems in FABRIC [Partially Resolved]’ is closed to new replies.