Home › Forums › FABRIC General Questions and Discussion › Any one get lucky with ~100Gbps bandwidth?
- This topic has 9 replies, 3 voices, and was last updated 2 years, 3 months ago by Paul Ruth.
-
AuthorPosts
-
August 11, 2022 at 9:10 am #2623
Hi,
I did a lot of tests on several nodes, and I still get no luck with a ~100Gbps bandwidth. Basically, I first reserve two nodes close to each other (UTAH and SALT), with ConnectX_6 NIC. Then I do TCP tuning following up with the guidance from ESnet (https://fasterdata.es.net/host-tuning/linux/), do 32 parallel streaming, set up jumbo MTU to 9000, and reduce the maximum flow rate to avoid bursts of packets. With so many tests, I can get at most 56Gbps.
I don’t know what to do next. Do you guys have any luck with nearly 100 Gbps bandwidth?
August 11, 2022 at 10:02 am #2624100G is quite difficult to achieve, especially across wide-area links. 56G
What size is the VMs that you are using? You probably need the biggest one available: 64 cores, 385G ram.
Look at your dropped packets. Getting 100G probably requires zero dropped packets. I suspect the connection is not that clean yet. Also, 100G will be impossible if other users are using any bandwidth at all.
Also, I suspect we need to set CPU affinity to ensure streams are in the same numa domains as the 100G NIC. We haven’t looked into that yet.
August 11, 2022 at 3:40 pm #2628Hi Chengyi,
Something has definitely changed as I once achieved ~98 Gbps when running iPerf in parallel on the SALT-UTAH link. This was the only link I personally was ever able to get anywhere close to 100 Gbps.
I suspect that the increase in active users has a lot to do with it, as my previous test was on May 16, 2022. Unless there were changes to the backend configuration at SALT-UTAH that I do not know about @Paul?
Running the same notebook today yielded 28.435 Gbps (combined over 10 parallel streams). I cannot remember if I previously used Connect X6s or Basic NICs to achieve the 98Gbps, but both Connect X6s at SALT and UTAH were reserved today, so I had to run the test with Basic NICs. I’ll rerun the tests with Connect X6s once they become available again.
August 11, 2022 at 3:46 pm #2629Thank you, Rice for the test! It was me occupied two NICs between SALT and UTAH. And I tested this morning with 32 parallel streams on ConnectX6s and achieved 56Gbps. I don’t want to release my reservation, so would you mind sharing your notebooks with me so I could test based on your settings on my side? Thank you!
August 11, 2022 at 8:32 pm #2630Okay, the plot thickens!
I just ran more tests and I just achieved 98.061 Gbps from UTAH -> SALT (with 32 parallel streams), but only 30.653 Gbps from SALT -> UTAH. (The one thing I fixed to get back to 97 Gbps was I accidentally had turned off fair-queuing before.)
Time to put on our thinking caps!
@Chengyi, yes I’ll see if I can upload my notebook, but in case I can’t, here are the tuning parameters:
I’m using ifconfig:
sudo ifconfig ens7 mtu 8900 up
to set the MTU to 8900. 8900 was (is?) the max MTU size for FABRIC in May 2022. I’m not sure if they have increased it since then. If you were at 9000, this would explain why you had a low throughput (dropped packets).I’m using
sudo tc qdisc add dev ens7 root fq maxrate 30gbit
to use a fair-queuing model.And I write the following lines to
/etc/sysctl.conf
(on Ubuntu 20 if that matters, I don’t think so…):# increase TCP max buffer size setable using setsockopt() net.core.rmem_max = 536870912 net.core.wmem_max = 536870912 # increase Linux autotuning TCP buffer limit net.ipv4.tcp_rmem = 4096 87380 536870912 net.ipv4.tcp_wmem = 4096 65536 536870912
Hope this helps!
– Brandon
August 11, 2022 at 9:19 pm #2631Hi Brandon,
Thank you for your help! With your settings and sites choose, I can also get 90 or so Gbps, cheers! And in my ConnectX6, it reaches the number too!
I will test it tomorrow morning again and see, since I suspect no one is using FABRIC at late night now 🙂
Anyway, thank you for your help. Have a good night~
Best,
Chengyi
August 11, 2022 at 11:08 pm #2632Nevermind, just re-ran the tests again, now at off-hours and got:
SALT -> UTAH: 91.98 Gbps
UTAH -> SALT: 96.938 GbpsSo it must have been users or something.
August 11, 2022 at 11:11 pm #2633Edit – I forgot to refresh the page so I didn’t see your reply @Chengyi
That’s great, I’m glad you got it working! None of the other sites I’ve tested are close to 90 Gbps, most are around 10 Gbps. Maybe the numa domains issue @Paul suggested.
– Brandon
August 12, 2022 at 8:52 am #2635BTW, @Paul. Does FABRIC have network usage plots by time? That way we can see how busy the links are.
– Chengyi
August 12, 2022 at 12:45 pm #2637@Chengyi — We do not currently have usage info about the network links. We have a monitoring framework that is in the process of being built but is not quite ready yet. Also, currently WAN bandwidth is shared and best-effort. Eventually you will be able to reserve bandwidth and it will be easy to know your target max bandwidth for these tests.
I’m glad to see you are getting better bandwidths!
-
AuthorPosts
- You must be logged in to reply to this topic.