1. Difference between throughput after maintenance

Difference between throughput after maintenance

Home Forums FABRIC General Questions and Discussion Difference between throughput after maintenance

Viewing 7 posts - 1 through 7 (of 7 total)
  • Author
    Posts
  • #6325

    I have noticed that after maintaining/updating FABlib to version 1.6, the throughput between some nodes/sites is experiencing a type of limitation.

    Below is the link to a document (GDocs) showing the topology and tests carried out yesterday (01/11/2024) using iperf3.

    https://docs.google.com/document/d/1DSZDqPEBVjQ_we717I_5n0nedCiT0JqbKwAD7-K0LaQ/edit?usp=sharing

    Is there any new feature that could cause this limitation?

    #6341

    I recreated the same slice and got the following results.

    
    h1  >  r1
    [  5]   0.00-0.94   sec  50.2 MBytes   448 Mbits/sec    9             sender
    [  5]   0.00-0.98   sec  48.7 MBytes   416 Mbits/sec                  receiver
    r1  >  r2
    [  5]   0.00-10.43  sec  50.6 MBytes  40.7 Mbits/sec   44             sender
    [  5]   0.00-10.47  sec  46.8 MBytes  37.5 Mbits/sec                  receiver
    r1  >  r3
    [  5]   0.00-82.03  sec  50.1 MBytes  5.13 Mbits/sec  356             sender
    [  5]   0.00-82.07  sec  49.8 MBytes  5.09 Mbits/sec                  receiver
    r1  >  r5
    [  5]   0.00-454.26 sec  50.1 MBytes   925 Kbits/sec  3455             sender
    [  5]   0.00-454.30 sec  49.8 MBytes   919 Kbits/sec                  receiver
    r2  >  r3
    [  5]   0.00-0.37   sec  50.8 MBytes  1.17 Gbits/sec    0             sender
    [  5]   0.00-0.41   sec  50.0 MBytes  1.03 Gbits/sec                  receiver
    r3  >  r4
    [  5]   0.00-1.16   sec  50.2 MBytes   362 Mbits/sec    0             sender
    [  5]   0.00-1.21   sec  49.4 MBytes   343 Mbits/sec                  receiver
    r4  >  r5
    [  5]   0.00-0.48   sec  51.2 MBytes   892 Mbits/sec    0             sender
    [  5]   0.00-0.52   sec  50.2 MBytes   811 Mbits/sec                  receiver
    r4  >  h2
    [  5]   0.00-0.13   sec  51.0 MBytes  3.34 Gbits/sec    0             sender
    [  5]   0.00-0.17   sec  49.7 MBytes  2.52 Gbits/sec                  receiver
    
    #6345
    Paul Ruth
    Keymaster

      Edgard,

      I don’t think there would be anything that would limit you to bandwidth that low.  All but a few sites should support 100 Gbps (a few can only provide 10 Gpbs).  I would expect much higher bandwidth than you are seeing. Even using multiple software routes, I would expect 10s of Gbps.

      What NIC types are you using?

      What VM size are you using?

      How are you forwarding traffic in you routers?

      Are you tuning the TCP/IP configuration of your nodes (congestion control algorithm, MTU, buffer sizes, etc)?

      Also, are you pinning your nodes to the NIC’s NUMA domain? NUMA pinning example. 

      Paul

      • This reply was modified 3 months, 2 weeks ago by Paul Ruth.
      #6347

      Hi Paul, thanks for answering me!

      What NIC types are you using?

      I’m using NIC_ConnectX_5 NICs on this test.

      What VM size are you using?

      All nodes are default_rocky_8 with 2 cores and 8 GB RAM.

      How are you forwarding traffic in you routers?

      On these tests, I’m using static routes and different routes with TOS. Basically, all tests are made with iperf3 (TCP).

      Are you tuning the TCP/IP configuration of your nodes (congestion control algorithm, MTU, buffer sizes, etc)?

      We are investigating congestion control with the Cubic and BBR algorithms. These tests are using Cubic. All other settings have not been changed.

      Also, are you pinning your nodes to the NIC’s NUMA domain? NUMA pinning example.

      One of the main objectives of this experiment, in addition to investigating congestion control, is the replication property of all tests.

      Apparently, in the topology presented, the main bottleneck is the node in Los Angeles (LOSA).

      #6360
      Paul Ruth
      Keymaster

        One more question… were able to replicated this before?  By replicate I mean run it once in one slice, then delete that slice and run it again in a new slice.

        I think the main issue here is combination of VMs that are too small (memory/cores) to achieve good bandwidth and that you are not pinning cores to NUMA domains. Without pinning you will not likely get repeatable performance. The issue is that if your VM cores are not in the same NUMA domain as your NIC, you will get worse performance.  This is especially true for the router nodes.  When you create a slice, your virtual cores will float between the available physical cores. Since there are other users on the host, you will not know anything about the placement of your virtual cores.

        I suggest using much larger VMs and pinning the cores to the appropriate NUMA domains.

        One more thing, which version of iPerf3 are you using?  The iPerf3 that is available in most linux repos is single threaded. I recommend using the new version suported by ESnet (https://github.com/esnet/iperf).

        #6537

        Hi Paul! Thanks for the feedback.

        Sorry for the delay in responding to you. Our group was developing an article with the theme “Sliced WANs for Data-Intensive Science” and all experiments were carried out in FABRIC. After we submitted the paper, I found some time to go back to running other experiments.

        I split the previous topology [LINK] into two different ones.
        Topology 1: SEAT (h1), MASS (r1), SALT (r2), STAR (r3), NEWY (h2).
        Topology 2: LOSA (h1), DALL (r1), ATLA (r2), WASH (r3), NEWY (h2).

        In topology 2, I obtained the following results (100MB each TCP flow, iPerf v. 3.5.0):

        h1 > r1
        [ 5] 0.00-230.51 sec 100 MBytes 3.64 Mbits/sec 697 sender
        [ 5] 0.00-230.55 sec 99.6 MBytes 3.62 Mbits/sec receiver
        r1 > r2
        [ 5] 0.00-0.69 sec 101 MBytes 1.23 Gbits/sec 0 sender
        [ 5] 0.00-0.73 sec 100 MBytes 1.15 Gbits/sec receiver
        r2 > r3
        [ 5] 0.00-0.53 sec 101 MBytes 1.60 Gbits/sec 0 sender
        [5] 0.00-0.57 sec 99.5 MBytes 1.47 Gbits/sec receiver
        r3 > h2
        [ 5] 0.00-0.22 sec 100 MBytes 3.88 Gbits/sec 172 sender
        [ 5] 0.00-0.26 sec 99.4 MBytes 3.26 Gbits/sec receiver
        h1 > h2
        [ 5] 0.00-459.85 sec 100 MBytes 1.83 Mbits/sec 740 sender
        [ 5] 0.00-459.92 sec 99.5 MBytes 1.82 Mbits/sec receiver

        Again, results that pass through LOSA show a decrease in the transmission rate.

        One question I still have: is whether there is a difference between L2 overlay L2STS and L2PTP for throughput tests?

        #6546
        Paul Ruth
        Keymaster

          There is nothing in particular that is different about LOSA.

          What jumps out at me about these results is that they are at least an order of magnitude too low.  With dedicated ConnectX-5 cards you should be seeing nearly 25 Gpbs.  I suspect that your test case is too small. Your 100 MB test probably doesn’t get out of the TCP ramp up phase of the connection.  You should try transferring several hundred GB… or better yet, run the tests for a set amount of time (at least 1 min).  You should also use much larger VMs, set the MTUs to 9000, and consider adjusting your buffer sizes.

          Try running the example iPerf3 notebook but manually set the sites to LOSA and DALL. You should see much higher bandwidths. Then tweak that test, in small steps, with your desired configuration and see what causes the bandwidth to drop.

          I think your tests are really testing the performance capabilities of the VMs, buffers, etc. but not the network.

          Also, if you really want repeatability, you will need to use the NUMA pinning examples. Without explicitly choosing the NUMA domain for your cores, you will get random physical cores that may result much lower performance.

          For reference, here is the output of a the example iPerf3 notebook using LOSA and DALL. Note that you can get nearly 100 Gbps if you increase the VM size and pin the cores to the correct NUMA domain:

          
          <pre>Connecting to host 10.137.3.2, port 5201
          [  5] local 10.133.130.2 port 56288 connected to 10.137.3.2 port 5201
          [  7] local 10.133.130.2 port 56294 connected to 10.137.3.2 port 5201
          [  9] local 10.133.130.2 port 56310 connected to 10.137.3.2 port 5201
          [ 11] local 10.133.130.2 port 56318 connected to 10.137.3.2 port 5201
          [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
          [  5]   0.00-10.01  sec  14.3 GBytes  12.3 Gbits/sec  11207   52.6 MBytes       (omitted)
          [  7]   0.00-10.01  sec  15.4 GBytes  13.2 Gbits/sec  12714   63.5 MBytes       (omitted)
          [  9]   0.00-10.01  sec  15.6 GBytes  13.4 Gbits/sec  11597   64.3 MBytes       (omitted)
          [ 11]   0.00-10.01  sec  20.3 GBytes  17.4 Gbits/sec  31095    201 MBytes       (omitted)
          [SUM]   0.00-10.01  sec  65.5 GBytes  56.2 Gbits/sec  66613             (omitted)
          - - - - - - - - - - - - - - - - - - - - - - - - -
          [  5]   0.00-10.01  sec  11.4 GBytes  9.77 Gbits/sec  2531   84.9 MBytes       
          [  7]   0.00-10.01  sec  15.7 GBytes  13.4 Gbits/sec  3213    123 MBytes       
          [  9]   0.00-10.01  sec  17.7 GBytes  15.2 Gbits/sec  3833    143 MBytes       
          [ 11]   0.00-10.01  sec  18.4 GBytes  15.8 Gbits/sec  3280    145 MBytes       
          [SUM]   0.00-10.01  sec  63.2 GBytes  54.2 Gbits/sec  12857             
          - - - - - - - - - - - - - - - - - - - - - - - - -
          [  5]  10.01-20.01  sec  11.4 GBytes  9.79 Gbits/sec    0   89.5 MBytes       
          [  7]  10.01-20.01  sec  16.4 GBytes  14.1 Gbits/sec    0    124 MBytes       
          [  9]  10.01-20.01  sec  18.7 GBytes  16.1 Gbits/sec    0    144 MBytes       
          [ 11]  10.01-20.01  sec  18.7 GBytes  16.0 Gbits/sec    0    142 MBytes       
          [SUM]  10.01-20.01  sec  65.2 GBytes  56.0 Gbits/sec    0             
          - - - - - - - - - - - - - - - - - - - - - - - - -
          [  5]  20.01-30.00  sec  11.0 GBytes  9.43 Gbits/sec  3639   86.7 MBytes       
          [  7]  20.01-30.00  sec  15.7 GBytes  13.5 Gbits/sec  5665    124 MBytes       
          [  9]  20.01-30.00  sec  17.9 GBytes  15.4 Gbits/sec  6044    139 MBytes       
          [ 11]  20.01-30.00  sec  17.6 GBytes  15.1 Gbits/sec  6159    139 MBytes       
          [SUM]  20.01-30.00  sec  62.1 GBytes  53.4 Gbits/sec  21507             
          - - - - - - - - - - - - - - - - - - - - - - - - -
          [ ID] Interval           Transfer     Bitrate         Retr
          [  5]   0.00-30.00  sec  33.8 GBytes  9.66 Gbits/sec  6170             sender
          [  5]   0.00-30.05  sec  33.6 GBytes  9.61 Gbits/sec                  receiver
          [  7]   0.00-30.00  sec  47.7 GBytes  13.7 Gbits/sec  8878             sender
          [  7]   0.00-30.05  sec  48.0 GBytes  13.7 Gbits/sec                  receiver
          [  9]   0.00-30.00  sec  54.3 GBytes  15.5 Gbits/sec  9877             sender
          [  9]   0.00-30.05  sec  54.5 GBytes  15.6 Gbits/sec                  receiver
          [ 11]   0.00-30.00  sec  54.7 GBytes  15.7 Gbits/sec  9439             sender
          [ 11]   0.00-30.05  sec  54.6 GBytes  15.6 Gbits/sec                  receiver
          [SUM]   0.00-30.00  sec   190 GBytes  54.5 Gbits/sec  34364             sender
          [SUM]   0.00-30.05  sec   191 GBytes  54.5 Gbits/sec                  receiver
          </pre>
          

           

          • This reply was modified 2 months, 3 weeks ago by Paul Ruth.
          • This reply was modified 2 months, 3 weeks ago by Paul Ruth.
          • This reply was modified 2 months, 3 weeks ago by Paul Ruth.
        Viewing 7 posts - 1 through 7 (of 7 total)
        • You must be logged in to reply to this topic.