1. Performance Drop on ConnectX-6 After Release 1.9

Performance Drop on ConnectX-6 After Release 1.9

Home Forums FABRIC General Questions and Discussion Performance Drop on ConnectX-6 After Release 1.9

Viewing 7 posts - 1 through 7 (of 7 total)
  • Author
    Posts
  • #9029

    Hello,

    I have been using ConnectX-6 SmartNICs for my research on high-speed data transfers. Before release 1.9, I was consistently getting around 20 Gbps throughput when running iperf3 directly over SSH (not inside Docker).

    However, since upgrading to release 1.9, I am now only seeing a maximum of about 5 Gbps with the same setup and configuration.

    For context:

    • I am using the iperf3 optimized SmartNIC notebook provided by FABRIC.
    • I run my experiments outside of Docker, after the site creation, using VSCode SSH connections.
    • The only change in my workflow was moving from the previous release to release 1.9.
    • For my research, I require 20–25 Gbps bandwidth between nodes located at two separate sites.

    Any insights into possible causes and potential mitigation plans would be greatly appreciated.

    Thanks,

    Rasman

    #9032
    Komal Thareja
    Participant

      Hi Rasman,

      By default, the standard iperf3 version does not perform well with multiple streams. ESnet provides a patched version that resolves this issue and delivers significantly better performance. This fixed iperf3 is already packaged inside the container.

      If you would like to run it directly on the host, you can install it with the following steps:

      curl -L https://github.com/esnet/iperf/releases/download/3.18/iperf-3.18.tar.gz > iperf-3.18.tar.gz
      tar -zxvf iperf-3.18.tar.gz
      cd iperf-3.18
      sudo apt update
      sudo apt install build-essential
      sudo ./configure
      make
      sudo make install
      

      Additionally, please make sure that the script node_tools/host_tune.sh (included with the notebook) has been executed on the relevant nodes.

      If you continue to see lower bandwidth, kindly share your slice ID so I can take a closer look.

      Thanks,
      Komal

      #9039

      Slice ID: 25c5b6c2-f0f8-4cc9-b4e1-cad570231aca

      One thing I forgot to mention is the execution often gets stuck in slice submission cell. Like, post boot config of one node is usually done but the other gets stuck. It gets stuck at this point, FIU’s node does not deliver the ‘done!’ message:

      Time to StableOK 246 seconds
      Running post_boot_config ... 
      Running post boot config threads ...
      Post boot config Node-GATECH, Done! (16 sec)
      

      Here’s the code:

      sites = ['GATECH', 'FIU']
      print(f"Sites: {sites}")
      
      node1_name = 'Node1'
      node2_name = 'Node2'
      cores=8
      ram=64
      disk=1000
      image='default_ubuntu_20'
      
      slice_name = 'iPerf3-tuned-nic-x6-64gb-1tb-GF-2'
      nic_name = 'nic1'
      model_name = 'NIC_ConnectX_6'
      network_name='net1'
      from ipaddress import ip_address, IPv4Address, IPv6Address, IPv4Network, IPv6Network
      
      subnet = IPv4Network("192.168.1.0/24")
      available_ips = list(subnet)[1:]
      
      #Create Slice
      slice = fablib.new_slice(name=slice_name)
      net1 = slice.add_l2network(name=network_name, subnet=subnet)
      
      for s in sites:
      # Node1
      node1 = slice.add_node(name=f"Node-{s}", cores=cores, ram=ram, disk=disk, site=s, image=image)
      
      iface1 = node1.add_component(model=model_name, name=nic_name).get_interfaces()[0]
      node1.add_component(model='NVME_P4510', name='nvme1')
      iface1.set_mode('auto')
      net1.add_interface(iface1)
      net1.set_bandwidth(50)
      
      node1.add_post_boot_upload_directory('node_tools','.')
      node1.add_post_boot_execute('sudo node_tools/host_tune.sh')
      # node1.add_post_boot_execute('node_tools/enable_docker.sh {{ _self_.image }} ')
      # node1.add_post_boot_execute('docker pull fabrictestbed/slice-vm-ubuntu20-network-tools:0.0.1 ')
      
      #Submit Slice Request
      slice.submit();
      
      
      

      I have to stop the execution and move to the next cell. I’ll report here what I get after running the esnet iperf3. Let me know if you need anything to investigate this issue.

      #9041

      Your provided snippet gets stuck at the ‘make’ command in both of the nodes:

      ubuntu@Node-GATECH:~/iperf-3.18$ make
      Making all in src
      make[1]: Entering directory '/home/ubuntu/iperf-3.18/src'
      make all-am
      make[2]: Entering directory '/home/ubuntu/iperf-3.18/src'
      CC iperf3-main.o
      main.c:212:1: fatal error: opening dependency file .deps/iperf3-main.Tpo: Permission denied
      212 | }
      | ^
      compilation terminated.
      make[2]: *** [Makefile:974: iperf3-main.o] Error 1
      make[2]: Leaving directory '/home/ubuntu/iperf-3.18/src'
      make[1]: *** [Makefile:733: all] Error 2
      make[1]: Leaving directory '/home/ubuntu/iperf-3.18/src'
      make: *** [Makefile:404: all-recursive] Error 1
      #9042
      Komal Thareja
      Participant

        Hi Rasman,

        I forgot to mention that the steps for installing iperf3 should be run as the root user. On your VM, I did the following:

        sudo su -
        
        curl -L https://github.com/esnet/iperf/releases/download/3.18/iperf-3.18.tar.gz > iperf-3.18.tar.gz
        tar -zxvf iperf-3.18.tar.gz
        cd iperf-3.18
        
        sudo apt update
        sudo apt install build-essential
        
        sudo ./configure; make; make install
        sudo ldconfig
        

        I also applied the following host tuning (node_tools/host_tune.sh) on both VMs:

        #!/bin/bash
        
        # Linux host tuning from https://fasterdata.es.net/host-tuning/linux/
        cat >> /etc/sysctl.conf <<EOL
        # allow testing with buffers up to 128MB
        net.core.rmem_max = 536870912
        net.core.wmem_max = 536870912
        # increase Linux autotuning TCP buffer limit to 64MB
        net.ipv4.tcp_rmem = 4096 87380 536870912
        net.ipv4.tcp_wmem = 4096 65536 536870912
        # recommended default congestion control is htcp or bbr
        net.ipv4.tcp_congestion_control = bbr
        # recommended for hosts with jumbo frames enabled
        net.ipv4.tcp_mtu_probing = 1
        # recommended to enable 'fair queueing'
        net.core.default_qdisc = fq
        #net.core.default_qdisc = fq_codel
        EOL
        
        sysctl --system
        
        # Turn on jumbo frames
        for dev in basename -a /sys/class/net/*; do
            ip link set dev $dev mtu 9000
        done
        

        With these changes, I’m now seeing bandwidth close to 10G (see snapshot below).

        Screenshot-2025-09-24-at-5.31.23-PM

        According to fablib.list_links(), links from GATECH are capped at 8G. I’d suggest trying a different site instead of GATECH.

        Screenshot-2025-09-24-at-5.41.06-PM

        Regarding the slice getting stuck at Submit: your keys may have expired. Please try running the notebook jupyter-examples-rel1.9.0/configure_and_validate/configure_an_validate.ipynb. This should automatically renew your keys if needed.

        If it still hangs at submit, please check /tmp/fablib/fablib.log for errors and share here.

        Best,

        Komal

        #9046

        I have booked a slice ( d6065a22-c893-425f-b12f-3bc0fe4d2481 ) with NEWY and CERN nodes which are listed as 320Gbps. This time, it did not get stuck. Everything went smoothly. I am still getting around 3 Gbps.

        Could you please have a look?
        I saw that there is another 8 Gbps line listed for (NewY, CERN). Can you guide me on how to pick sites so that I can get the fastest network speed?

        (‘NEWY’, ‘CERN’) link:local-port+cern-data-sw:FourHundredGigE0/0/0/26.3733:remote-port+newy-data-sw:FourHundredGigE0/0/0/60.3733 320 N/A L2

         

        #9047

        It worked after manually doing the steps you described. Thanks.

      Viewing 7 posts - 1 through 7 (of 7 total)
      • You must be logged in to reply to this topic.