1. Nishanth Shyamkumar

Nishanth Shyamkumar

Forum Replies Created

Viewing 15 posts - 1 through 15 (of 25 total)
  • Author
    Posts
  • in reply to: FPGA valid sites for Esnet toolchain #8501
    Nishanth Shyamkumar
    Participant

      Please find attached the logs of the latest error from running it on LOSA.

      in reply to: FPGA valid sites for Esnet toolchain #8500
      Nishanth Shyamkumar
      Participant

        Hi Komal,

        Thanks for the information. I understand what you are saying, but I would like some clarification on it:
        1) When a user acquires the FPGA and flashes their own binary onto it, then am I right in understanding that no one other user can flash binaries onto that FPGA as long as my slice is active?  So the acquisition of the FPGA via the slice is in effect a lock?
        2) The toolchain will stay consistent as mentioned in the attachment. That is, if I have a bitfile generated using the Esnet toolchain, and run on a site that says it has Esnet support, then assuming the bitfile is not corrupted, the flash should succeed. Similarly if I use my Esnet toolchain generated bitfile and try to flash it onto a site supporting NEU or XDMA toolchains , it should always fail correct ?

        Right now, what I see is my bitfiles work on TACC. So they are valid bitfiles. However, when I try to run it on other sites that say they support the Esnet toolchain, the same bitfiles are not flashed correctly, and the health checks fail.

        in reply to: FPGA valid sites for Esnet toolchain #8495
        Nishanth Shyamkumar
        Participant

          Hi Komal,

          Thanks for sharing the latest list of sites supported by the Esnet toolchain.

          You mentioned the following,

          “Kindly note that users have the ability to flash their own binaries, so the actual state of the infrastructure may differ from what is captured in the attached sheet”

          I didn’t understand what you meant. Could you elaborate further?

          in reply to: Unable to connect to http://linux.mirrors.es.net/ubuntu #8494
          Nishanth Shyamkumar
          Participant

            Yes, it’s working now for me as well.

            It seems to have been a transient issue. Thanks.

            in reply to: Tofino bf_switchd process gets killed. #8463
            Nishanth Shyamkumar
            Participant

              @yoursunny, Yes, the SIGHUP is sent when the user closes the terminal.
              I was confused because I certainly wasn’t doing anything, so how was it getting generated. Now it makes sense, the node.execute_thread for this specific interactive mode, has an SSHClientInteraction which terminates if it doesn’t see the prompt after the timeout.

              @Komal, I think just informing the user that bf_switchd will exit after 300seconds / timeout seconds, by adding extra information to the comment

              # Keep the session open to prevent exit

              should be enough guidance for us to increase the timeout as required.

              in reply to: bastion key fails authentication #8252
              Nishanth Shyamkumar
              Participant

                Hi,

                I regenerated a fresh new keypair and it works now. Thanks.

                in reply to: Infrastructure-metrics queries #7888
                Nishanth Shyamkumar
                Participant

                  Hi,

                  A follow up question on this,
                  1) Does this mean that HC always holds the correct value of that counter ?
                  2) What happens to non-HC counters when it exceeds 32 bits? Does it get set to 2^32 – 1, or does it overflow and we see the remainder (true value % (2^32)) in this field ?

                  in reply to: How to use long-lived tokens in experiments #7283
                  Nishanth Shyamkumar
                  Participant

                    Thanks Komal, I tested it and it is working without any issues after the update.

                    in reply to: No candidate nodes found error #7244
                    Nishanth Shyamkumar
                    Participant

                      Thanks for the info. Is there some way to get the maintenance status of a site through some API , or must the user just keep track of it through forum updates?

                      in reply to: Slice resubmit fails with already configured error. #7243
                      Nishanth Shyamkumar
                      Participant

                        Hi Komal,

                        Here is a code snippet, it’s a bit complex since there are a some design mechanisms at play here. However, the essential part is:
                        There is a while loop that attempts to setup the slice and request port mirror resources by invoking setup_slice(). If it fails, then the failed slice is deleted and in the next attempt, the number of VMs requested are reduced and the slice creation is once again requested.

                         

                         

                         

                        def setup_slice():
                            …
                            A block of code that checks for available smartNICS and assigns a VM for each.
                            Splits total available switch ports on 1 site into N groups, where N is the number of VMs.
                            Specify other resources like CPU, RAM etc.
                            …
                            pmnet = {}
                            num_pmservices = {}     # Track mirrored port count per VM
                            listener_pmservice_name = {}
                            ports_mirrored = {}     # Track mirrored port count per site
                            random.seed(None, 2)
                            for listener_site in listener_sites:
                                pmnet[listener_site]=[]
                                # To keep track of ports mirrored on each site, within the port list
                                ports_mirrored[listener_site] = 0
                                j = 0
                                max_active_ports = port_count[listener_site]
                                for listener_node in listener_nodes[listener_site]:
                                    k = 0
                                    listener_interface_idx = 0
                                    listener_pmservice_name[listener_node] = []
                                    node_name = listener_node_name[listener_site][j]
                                    avail_port_node_maxcnt = len(mod_port_list[listener_site][node_name])  # Each node(VM) monitors an assigned fraction of the total available ports.
                                    for listener_interface in listener_interfaces[node_name]:
                                        #print(f’listener_interface = {listener_interface}’)
                                        if (listener_interface_idx % 2 == 0):
                                            random_index = random.randint(0, int(avail_port_node_maxcnt / 2 – 1))   # first listener interface of NIC randomizes within the first half
                                        else:
                                            random_index = random.randint(int(avail_port_node_maxcnt/2), avail_port_node_maxcnt – 1) # second listener interface randomizes within the second half
                                        listener_interface_idx += 1
                                        if ports_mirrored[listener_site] < max_active_ports:
                                            listener_pmservice_name[listener_node].append(f'{listener_site}_{node_name}_pmservice{ports_mirrored[listener_site]}’)
                                            pmnet[listener_site].append(pmslice.add_port_mirror_service(name=listener_pmservice_name[listener_node][k],
                                                                  mirror_interface_name=mod_port_list[listener_site][node_name][random_index],
                                                                  receive_interface=listener_interface,
                                                                  mirror_direction = listener_direction[listener_site]))
                                            with open(startup_log_file, “a”) as slog:
                                                slog.write(f”{listener_site}# mirror interface name: {mod_port_list[listener_site][node_name][random_index]} mirrored to {listener_interface}\n”)
                                                slog.close()
                                            ports_mirrored[listener_site] = ports_mirrored[listener_site] + 1
                                            k = k + 1
                                        else:
                                            with open(startup_log_file, “a”) as slog:
                                                slog.write(f”No more ports available for mirroring\n”)
                                                slog.close()
                                                break
                                    j = j + 1
                                    num_pmservices[listener_node] = k
                        #Submit Slice Request
                        port_reduce_count = 0
                        retry = 0
                        while (retry != 1):
                            try:
                                setup_slice(port_reduce_count)
                                pmslice.submit(progress=True, wait_timeout=2400, wait_interval=120)
                                if pmslice.get_state() == “StableError”:
                                    raise Exception(“Slice state is StableError”)
                                retry = 1
                            except Exception as e:
                                if pmslice.get_state() == “StableError”:
                                    fablib.delete_slice(listener_slice_name)
                                else:
                                    pmslice.delete()
                                time.sleep(120)

                         

                         

                         

                         

                        in reply to: How to use long-lived tokens in experiments #7207
                        Nishanth Shyamkumar
                        Participant

                          Hi Komal,

                          I tried this and it still does not work. Here are the fabric packages in my environment:

                          [code]

                          pip list | grep fab │
                          fabric-credmgr-client 1.6.1 │
                          fabric_fim 1.6.1 │
                          fabric_fss_utils 1.5.1 │
                          fabric-orchestrator-client 1.6.1 │
                          fabrictestbed 1.6.9 │
                          fabrictestbed-extensions 1.6.5

                          [/code]

                          The fabrictestbed is at 1.6.9, yet the slice_manager.py and specifically the __load_tokens still has the refresh token Exception check.

                          in reply to: How to use long-lived tokens in experiments #7125
                          Nishanth Shyamkumar
                          Participant

                            Hi Komal,

                            Looking at the source code, the required change in slice_manager.py is not present on the main branch. It is available in the other branches: adv-res, llt and 1.7
                            Should I use one of these branches to use the long lived tokens?
                            Essentially:
                            pip install git+https://github.com/fabric-testbed/fabrictestbed@1.7

                            in reply to: How to use long-lived tokens in experiments #7093
                            Nishanth Shyamkumar
                            Participant

                              Hi Komal,

                              I am using fablib from within a Python program. Can you let me know which branch of fabrictestbed-extensions should I use to have this updated change? Is it the main branch? Or branch 1.7?

                              pip install git+https://github.com/fabric-testbed/fabrictestbed-extensions@main

                              in reply to: TACC always failing with insufficient resources:Disk# #7059
                              Nishanth Shyamkumar
                              Participant

                                Thanks, so it does indeed stand for disk space.

                                When I look at the graphical stats on the Fabric Portal, it mentions that TACC has 103263/107463 GB free (it may not be the latest info, but I don’t think it varies by much). How can I ask Fabric to assign my VM on an underlying server where there is enough hard disk space ?

                                in reply to: Multi-day FABRIC maintenance (January 1-5, 2024) #6223
                                Nishanth Shyamkumar
                                Participant

                                  “These 4 sites will be placed in pre-maintenance mode several days in advance so that no new experiments can be created after the indicated date. We apologize for any inconvenience this may cause.”

                                  Which is the indicated date mentioned here? Is it the date that these sites go into pre-maintenance or is it Jan 1st? In other words, can I create new slivers on these sites until Jan 1st?

                                Viewing 15 posts - 1 through 15 (of 25 total)