Expected transfer speed between nodes

This topic has 2 replies, 2 voices, and was last updated 4 years, 3 months ago by Paul Ruth.

Viewing 3 posts - 1 through 3 (of 3 total)

Author

Posts
March 30, 2022 at 3:33 pm #1595
Xusheng Ai
Participant
Hello,

As we tend to create a Genome Data Lake, supposed we create a slice on fabric, and one of node is treated as client, others nodes are hydra nodes. Users may insert/fetch Genome files to/from Hydra nodes. Right now, due to the NFD we used, it limits transfer speed to 5mbps. In the future , we will use the forwarder that allows us to get to ~100gbps.

So I was wondering how much should transfer speed be between Fabric nodes? Will it make a big difference when nodes across sits?

Thanks for the help,

Xusheng
March 31, 2022 at 11:21 am #1600
Xusheng Ai
Participant
We run iperf command to test the transfer speed between Fabric nodes with L2 network connection.
Here are test results:
- To transfer 19.8 GBytes file, the speed between nodes is 17.0 Gbits/sec at the same UTAH site.
- To transfer 870 MBytes file, the speed between nodes is 729 Mbits/sec at two different sites, UTAH and STAR respectively.
We found that the second test run is much lower than the first test. Is it because the size of the file or other factors.
March 31, 2022 at 12:56 pm #1601
Paul Ruth
Keymaster
There are a few things to unpack here so I hope this helps:

Existing FABRIC Links: Currently, we are deploying links as they become available. None of the 1 Tbps links are ready yet. Many of the links we have deployed are dedicated 100 Gbps links but some those are also not ready so we are temporarily using Internet2 AL2s L2 connections until our dedicated links are ready. The AL2S connections we are using vary in bandwidth. They depend on the level AL2s service that exists at the site’s host institution. The most common level of service is 10 Gbps. If you are getting exactly 10Gbps, you are likely using one of these links. Also, some of the current links do not match the final topology that are working toward.

Summary: a few of our links are 100 Gbps but some are 10 Gbps. These are your current theoretical limits.

Achieving theoretical bandwidth limits: This can be quite challenging in practice. There are several resources online that can help. These are a good start:

https://fasterdata.es.net/host-tuning/linux/

https://srcc.stanford.edu/100g-network-adapter-tuning

Eventually, we will need to do a full performance test of all of our high bandwidth links. We started looking at them a bit but haven’t really been able to complete that work yet. We have shared some of our testing/debugging notebooks in the Jupyter examples repo. The notebook at the link below is not really complete and is a bit dated in terms of the FABlib version it is using, but I think it might help you tune for iperf tests. We we were playing with it earlier we were getting 25-30 Gbps across 100 Gbps wide-are links (which isn’t that great but is a good start). Its worth noting that most of the tuning targets IP and TCP so it might not apply directly to your NDN work. You probably need to figure out how to apply the TCP/IP tuning concepts to your NDN configuration.

https://github.com/fabric-testbed/jupyter-examples/blob/master/fabric_examples/testing_and_debugging/test-wan-networks.ipynb

Other considerations:
- VM size. You won’t be able to get high bandwidths without many cores and memory. What size VMs are you using? Try at least 16 cores/64 GB memory (even more would be better). This will be especially true if for you NDN routers which, I assume, do some processing to help find/get/put files where then need to be.
- VM image placement: Which hosts are your VMs on? When using Basic NICs you are using 100Gbps SR-IOV virtual functions. The 100Gbps bandwidth used by these VFs is shared by all VFs on that particular host. If many of your VMs are on the same host, you may be competing with yourself for bandwidth.
- Consider limiting the bandwidth your application is trying to use. One thing that happens with higher bandwidth end-hosts (common in 40/100 Gbps) is that you try to send 100Gpbs but some step along the path to the destination is slower (often significantly slower) than your end host. This causes a lot of packet loss at that step. High packet loss, in turn, tiggers the TCP to backoff to extremely low bandwidth. This has been shown to happen with even small amounts of packet loss in high latency environments. I assume your NDN work is not using TCP. If you are tunneling you are probably using UDP and if you are trying a native deployment (which FABRIC is perfect for) you are using lower level packet sending protocol. If this protocol does not backoff to a reasonable speed your will likely see a lot of packet loss and very low speeds.
I’ve probably left out something but this should be useful. I think
Author

Posts

Viewing 3 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic.