1. Komal Thareja

Komal Thareja

Forum Replies Created

Viewing 15 posts - 16 through 30 (of 515 total)
  • Author
    Posts
  • Komal Thareja
    Participant

      Hi Danilo,

      I found that the authorized_keys file on both NS1 and NS5 was empty, which is why SSH—whether through the admin key or the Control Framework—was failing resulting in POA/addKey failure. It seems this may have happened unintentionally as part of the experiment.

      I’ve manually restored SSH access so the Control Framework should now function properly, including POA. Could you please try adding your keys to these VMs again using POA? That should re-establish your SSH access.

      Please be careful not to remove or overwrite the authorized_keys file in the process.

      Best,

      Komal

      in reply to: Bluefield DPL pull failing due to timeout #9170
      Komal Thareja
      Participant

        I tried running docker pull manually on DALL and SEAT, and it worked fine on both. The artifact also ran successfully on SEAT with following changes. The issue appears to be related to the Docker installation via docker.io.

        I have also passed this to the artifact author so they can make the required updates.

        I made the following changes to get the artifact working:

        • Changed the image to docker_ubuntu_24.
        • Updated Step 34 to remove docker.io from the installation commands.
        stdout, stderr = node1.execute('sudo apt-get update', quiet=True)
        stdout, stderr = node1.execute('sudo apt-get install -y build-essential python3-pip net-tools', quiet=True)
        stdout, stderr = node2.execute('sudo apt-get update', quiet=True)
        stdout, stderr = node2.execute('sudo apt-get install -y build-essential python3-pip net-tools', quiet=True)
        stdout, stderr = node1.execute('sudo pip3 install meson ninja', quiet=True)
        stdout, stderr = node2.execute('sudo apt install -y python3-scapy', quiet=True)
        

        Best,
        Komal

        in reply to: Bluefield DPL pull failing due to timeout #9169
        Komal Thareja
        Participant

          Hi Nishanth,

          I tried on UTAH, MICH, MASS and docker pull seems to work.

          Could you please try nslookup nvcr.io and then try the docker pull command?

          I will also check with Mert/Hussam to see if we have any known issues on SEAT and DALL.

          Best,

          Komal

          in reply to: Bluefield DPL pull failing due to timeout #9165
          Komal Thareja
          Participant

            Hi Nishanth,

            Could you please share which Site is your slice running at?

            Best,

            Komal

            in reply to: Establish communication between FPGA to GPU via PCIe #9154
            Komal Thareja
            Participant

              Hi Paresh,

              Currently, FABRIC allows users to create VMs where GPUs or FPGAs can be attached via PCI passthrough. However, direct communication between FPGA and GPU over PCIe (such as peer-to-peer DMA or RDMA transfers) is not supported.

              This is because for true PCIe peer-to-peer access, both devices need to be physically located on the same host and share the same PCIe root complex or switch. At present, none of the FABRIC nodes have both a GPU and an FPGA installed on the same host.

              If you’d like to double-check inventory yourself, you can list host capabilities with fablib:

              from fabrictestbed_extensions.fablib.fablib import FablibManager as fablib_manager
              
              fields = [
                  'name',
                  'fpga_sn1022_capacity', 'fpga_u280_capacity',
                  'rtx6000_capacity', 'tesla_t4_capacity', 'a30_capacity', 'a40_capacity'
              ]
              
              fablib = fablib_manager()
              output_table = fablib.list_hosts(fields=fields)
              

              You’ll see per-host capacities for each device type. It will show that hosts with FPGA capacity don’t also list GPU capacity (and vice versa), confirming that GPU+FPGA co-location isn’t available.

              Best regards,
              Komal

              Komal Thareja
              Participant

                Hi Fatih,

                You should be able to create multiple tunnels on the same NIC by using VLAN-tagged sub-interfaces. Each sub-interface can be assigned to a different L2PTP tunnel, allowing multiple distinct connections over the same physical port.

                Please check out the example notebook fabric_examples/fablib_api/sub_interfaces/sub_interfaces.ipynb for details on how to configure sub-interfaces.

                Best regards,
                Komal

                Komal Thareja
                Participant

                  Hi Geoff,

                  This appears to be a bug in fablib. As a workaround, could you please modify the call as follows?

                  client_interface = client_node.get_interface(network_name="client-net", refresh=True)
                  

                  This change should prevent the error from occurring. I’ll work on fixing this issue in fablib.

                  Best,
                  Komal

                  in reply to: Availability of DPU-powered SmartNICs #9145
                  Komal Thareja
                  Participant

                    Hi Tanay,

                    BlueField-3 nodes are now available on FABRIC, and we currently offer two variants:

                    • ConnectX-7-100 – 100 G
                    • ConnectX-7-400 – 400 G

                    To provision and use them, your project lead will need to request access through the Portal under Experiment → Project → Request Permissions.

                    Best,
                    Komal

                    1 user thanked author for this post.
                    Komal Thareja
                    Participant

                      Hi Geoff,

                      Just to confirm my understanding — your slice is in StableOK state, and the nodes display IP addresses as shown in your screenshot, but node.execute is failing with a “no management IP” error. Is that correct?

                      Could you please share your Slice ID here?

                      Thanks,
                      Komal

                      in reply to: pin_cpu & poa(operation=”cpupin”) #9131
                      Komal Thareja
                      Participant

                        Thank you, @yoursunny, for sharing these observations and the detailed steps to reproduce them. This appears to be a bug. I’ll work on addressing it and will update you once the patch is deployed.

                        Best,
                        Komal
                        1 user thanked author for this post.
                        in reply to: Slice Creation time / Configuring #9124
                        Komal Thareja
                        Participant

                          Hi Jiri,

                          We’ve been investigating two issues related to your recent observation:

                          1. Slice reaches StableOK, but management IPs don’t appear – This behavior seems to be caused by performance degradation in our backend graph database. We’re actively working to address and mitigate this issue.
                          2. Slice stuck in “doing post Boot Config” – This issue was traced to one of the bastion hosts. A fix for this has been applied earlier today.

                          If your slice is still active, could you please share the Slice ID where you observed this behavior? Additionally, if you encounter this issue again, it would be very helpful if you could send us the log file located at /tmp/fablib/fablib.log. This information will help us investigate and debug the issue more effectively.

                          Best regards,

                          Komal

                          in reply to: Clarification on “Host” Meaning in FABRIC Testbed #9100
                          Komal Thareja
                          Participant

                            Hi Fatih,

                            You are absolutely correct — in the FABRIC testbed, the term “host” refers to a single physical machine, not a group of blades or multiple servers.

                            Regarding your question about the core count:
                            The host you mentioned (for example, seat-w2 ) reports 128 CPUs because the physical server has two AMD EPYC processors, each with 32 physical cores and hyperthreading — enabled. This means each physical core presents two logical CPUs (threads) to the operating system.

                            So, the breakdown is:

                            • 2 sockets × 32 physical cores per socket = 64 physical cores
                            • With hyperthreading (2 threads per core): 64 × 2 = 128 logical CPUs

                            Inside a VM, you’ll typically see the processor model name (e.g., AMD EPYC 7543 32-Core Processor), which corresponds to the physical CPU model installed in the host. The number of vCPUs visible in the VM depends on the resources allocated to it by the hypervisor, not the total physical core count of the host.

                            In summary:

                            • Host = one physical machine
                            • 128 cores = 64 physical cores × 2 threads (hyperthreading)
                            • VM CPU info = underlying processor model, showing only allocated vCPUs

                            You can find more details about the hardware configurations for a FABRIC site here:
                            https://learn.fabric-testbed.net/knowledge-base/fabric-site-hardware-configurations/

                            Best regards,
                            Komal

                            in reply to: Clarification on “Host” Meaning in FABRIC Testbed #9098
                            Komal Thareja
                            Participant

                              Hi Fatih,

                              When requesting a VM in a slice, specifying the host parameter (for example, seat-w2.fabric-testbed.net) ensures that the VM is provisioned on that particular physical host. If multiple VMs in the same slice specify the same host (e.g., seat-w2.fabric-testbed.net), they will all be co-located on that same physical machine.

                              This can be done as follows:

                              slice.add_node(name="node1", host="seat-w2.fabric-testbed.net", ...)
                              slice.add_node(name="node2", host="seat-w2.fabric-testbed.net", ...)
                              

                              If the host parameter is not specified, the FABRIC Orchestrator automatically places the VMs across available hosts based on resource availability, which may result in them being distributed across different physical machines.

                              If the requested host cannot accommodate the VMs (due to limited capacity or resource constraints), the system will return an “Insufficient resources” error.

                              Best regards,
                              Komal

                              Komal Thareja
                              Participant

                                Hello Yuanhao,

                                Thank you for the detailed description of your experiment and topology — that’s very helpful.

                                For your described setup with multiple independent, point-to-point internal links between your own nodes across sites, using L2STS is perfectly valid and appropriate. L2STS provides private Layer-2 connectivity between nodes within your slice and is generally recommended when you want internal point-to-point links without stitching to external networks.

                                L2PTP, on the other hand, is typically used when:

                                • You are connecting Dedicated NICs,
                                • You need guaranteed QoS (bandwidth reservation), or
                                • You want to explicitly control the path (e.g., define intermediate hop sites).

                                Given that, your current L2STS-based slice (l25gclplus-yuanhao5) is indeed a good configuration for your 5G experiment.

                                Regarding the L2PTP error, you’re correct — the message indicates that the interface must be tagged with a VLAN ID. There appears to be a bug in the portal’s Slice Builder that currently prevents specifying VLAN tags for L2PTP connections. Thank you for helping us identify that and we will work to address it.

                                In the meantime, you can continue using L2STS, which should meet your needs for internal connectivity.

                                For additional context and examples, please refer to our Network Services documentation:
                                https://learn.fabric-testbed.net/knowledge-base/network-services-in-fabric/

                                Also, I’d recommend checking out the example notebooks available under the JupyterHub tab on the FABRIC Portal — they provide working examples of various network configurations, including multi-site and point-to-point topologies.

                                Best regards,
                                Komal Thareja

                                1 user thanked author for this post.
                                Komal Thareja
                                Participant

                                  It prints the interfaces after the slice is completely up.

                                  Could you please share a snippet of code or screenshot where you are observing this?

                                  Best,

                                  Komal

                                Viewing 15 posts - 16 through 30 (of 515 total)