Forum Replies Created
-
AuthorPosts
-
We are still in the construction phase of FABRIC. Although there are a plans for 30+ sites, currently only 10 of them are deployed on the production testbed. Many more sites are being delivered and/or configured. These will be released to the users as soon as they are ready. There will eventually be a site at Clemson but there are no plans for one at UCLA. There are plans for a site in Los Angeles but not at UCLA.
Regarding your question about latency: Every site has a management network that allows you to ssh to your VMs and pull software/data to them. This network is intended to be used for configuration of experiments, it is not intended for experimentation. This management network is connected to the public Internet (although protected by a bastion host). If you ping “clemson.edu” or “ucla.edu” from your VMs, you are not pinging anything inside FABRIC. Instead, you are pinging a host owned and managed by a university and you are sending traffic across the management network and the public Internet.
If you want to test latency across FABRIC, you will need to setup VMs at multiple FABRIC sites and ping between them over a network you setup within FABRIC. Note that on FABRIC you can deploy L2 or L3 networks and might see different latency depending on how you design your experiment. With L2 networks, you could even design multiple networks with different paths between the same pair of sites. You could even experiment with dynamically choosing between these paths to optimize your application.
- This reply was modified 2 years, 4 months ago by Paul Ruth.
- This reply was modified 2 years, 4 months ago by Paul Ruth.
- This reply was modified 2 years, 4 months ago by Paul Ruth.
You get full physical access to the NIC. The connectx-6 cards are 2x100G, connectx-5 are 2x25G, and the Basic NICs are connectx-6 SR-IOV VFs (100G but bandwidth is shared with other Basic NICs). We don’t artificially limit bandwidth on any NICs. Eventually, we have plans to have dedicated QoS for bandwidth across WAN connections, but the NICs themselves are physical NICs and have whatever bandwidth they were designed to have.
You may see different bandwidths between sites. Some of that could be because other users are sharing the links and couple sites don’t yet have their permanent physical connections. However, we have not ramped up usage yet and nearly all of our networks links are minimally used. I would not expect other users to significantly affect your WAN bandwidth right now and if they do it will be temporary.
I do expect that achieving high bandwidths will require tuning of end hosts (and maybe core switches). Soon we will try to do this ourselves and provide suggested tuning parameters but for now there is nothing artificial that prevents any user from achieving 100G across WAN FABRIC links. We just having looked into the right tuning parameters yet.
If you are interested in high bandwidths, I suggest starting with pairs of sites that are physically close to each other (UTAH-SALT, or WASH-MAX). Low latency makes higher bandwidth a lot easier to achieve.
July 21, 2022 at 2:11 pm in reply to: fablib : No such file or directory: ‘/tmp/fablib/fablib.log’ #2529This is a temporary bug from an update we just pushed. Either wait a bit and a fix will be pushed or type “mkdir /tmp/fablib” and it will work for now.
We have simplified the environment configuration process and this is a side effect. There are a couple ways to remedy this.
- Get the updated jupyter-examples notebooks and edit/run the configure environment notebook. This will create the config files you need and put them in fabric_config.
- Alternatively, you could delete the existing fabric_config folder and stop/start you jupyter container. This will reset the fabric_config files to sane defaults that allow existing notebooks to work with the new fablib.
- This reply was modified 2 years, 4 months ago by Paul Ruth.
July 19, 2022 at 9:51 am in reply to: When Creating a Slice, Sometimes Fails to Get NIC Components Correctly #2378I’m trying to recreate this but can not seem to intentionally trigger it. My guess is that is an issue related to the library having a temporary problem creating an ssh connection to the VM.
One thing you can try is to manually re-run the post_boot_config step when you see this. You can do this by calling
slice.post_boot_config()
. If this fixes the slice, then this is probably a temporary ssh issue.Another thing to do is to look at the log file. By default it is at
/tmp/fablib/fablib.log
. There might be something in there that hints at what is happening. Be warned that fablib retries the ssh connection a few times on failure, so you may see ssh failures that were resolved.If you do see this again, could you try to include any relevant section of the log file in the message?
Ah… there is a typo… “get_intefaces()” is missing an ‘r’.
It should be:
[ifaceRouterC,ifaceRouterS] = nodeRouter.add_component(model="NIC_ConnectX_5", name="cx5_nic").get_interfaces()
- This reply was modified 2 years, 4 months ago by Paul Ruth.
I think the forum markup messed up the quotes in the code snippet I sent. Paste it in and re-type the quotes. It will work.
Also, that error is an unnecessary exception that is thrown by the currently deployed version of fablib when you have an interface that is not attached to a network. In your case, it is caused by the second port of the connectx-5’s on your nodes. You can safely ignore the error and it will work fine. Basically, fablib gets confused when it is tries to configure an interface that you are not using. This error will be suppressed in the next version of fablib.
Yeah, that slice request requires 4 connectx-5’s and will need to be aware of their availability.
I have a few observations that may help:
– All 3 of your nodes are being sent to the same site. You might try putting them on different sites. From your perspective it will work about the same. The main difference will be that the latency between nodes is greater.
– If you only need low level configuration for some interfaces, you could mix-and-match NIC_ConnectX_5 with NIC_Basic. For example, maybe your router needs a ConnectX_5 but the nodes can use a NIC_Basic (or the other way around).
– Your router node is asking for 2 connectx-5’s. Each connectx-5 has two ports. If you really only need two ports you can use 1 connectx-5 for your router. The code for that will look something like:
connectx5_interfaces = nodeRouter.add_component(model=”NIC_ConnectX_5″, name=”cx5_nic”).get_interfaces() ifaceRouterC = connectx5_interfaces[0] ifaceRouterS = connectx5_interfaces[1]
or you can shorten it to:
[ifaceRouterC,ifaceRouterS] = nodeRouter.add_component(model=”NIC_ConnectX_5″, name=”cx5_nic”).get_interfaces()
- This reply was modified 2 years, 4 months ago by Paul Ruth.
- This reply was modified 2 years, 4 months ago by Paul Ruth.
- This reply was modified 2 years, 4 months ago by Paul Ruth.
- This reply was modified 2 years, 4 months ago by Paul Ruth.
- This reply was modified 2 years, 4 months ago by Paul Ruth.
Your project did not have permissions required to use smart NICs. I just added the permissions. Please try again.
I think that might be a side effect of NIC_Basic’s being SRIOV VFs with limited low level configuration available. You will have a bit more control if you use FABRIC’s dedicated NICs.
I tried running that ethtool command on a NIC_ConnectX_5 and it worked. I think the lower level control you need requires NIC_ConnectX_5 or NIC_ConnectX_6 NICs.
Let me know if that works for you.
Paul
Yeah, thats interesting. We probably don’t want passphrases stored in notebooks. There is an update to fablib coming in a week or so that should streamline a lot of these config issues. Part of it includes creating a fabric_rc file and a more sophisticated ssh config file. Together, these will remove the nearly all to env vars and other config from the notebooks. I will note that we should probably do something clever with the notebook that creates the config files so that passprases don’t get stored in them.
Also, the upload call should be a complete path including the file names. Like this:
node1.upload_file("/home/fabric/work/test_file", "/home/rocky/test_file")
Github is one of the few major sites that doesn’t work with IPv6.
- An easy way to work around this is to use a public NAT64 like this: https://nat64.net/
- More reliable but complicated solutions can be found here: https://learn.fabric-testbed.net/knowledge-base/using-ipv4-only-resources-like-github-or-docker-hub-from-ipv6-fabric-sites
- The permanent solution is to wait for github to enable IPv6.
I would try the nat64.net option. The only real negative is that it is a free public service and could disappear someday. In practice we have found it to work quite well.
One other option is to create a tarball of whatever you need to install and just transfer it in manually.
Oh, I misunderstood.
I think I saw in one of your other messages that you were using keys with passphrases. I’m suspicious about handling of the passphrase being the issue here. Or maybe one (or both) of your keys don’t match our requirements exactly.
I’ll need to create some keys with passphrases to check and may sure that still works. I suspect if passphrases don’t work I would have heard complaints by now. While I’m doing that, could you try some simple non-passphrase keyspairs? For the bastion key, can you let the portal generate the key? This would help narrow down where we need to look.
For reference, the key reqs are here: https://learn.fabric-testbed.net/knowledge-base/logging-into-fabric-vms/#ssh-keypair-primer-creating-identifying-fingerprinting-keypairs
Those are all great resources.
mtu = 9000 (jumbo frames) is important too.
With 100G NICs this part from fasterdata.es.net is important too:
We also strongly recommend reducing the maximum flow rate to avoid bursts of packets that could overflow switch and receive host buffers. For example for a 10G host, add this to a boot script: /sbin/tc qdisc add dev ethN root fq maxrate 8gbit For for a host running data transfer tools that use 4 parallel streams, do this: /sbin/tc qdisc add dev ethN root fq maxrate 2gbit Where 'ethN' is the name of the ethernet device on your system.
You are not allowed to ssh to our bastion host directly. You can only jump through it. The
node.execute
call does this for you using paramiko. It does this by creating a “channel” using paramiko.If you want to replicate this in your own code the example is here: https://github.com/fabric-testbed/fabrictestbed-extensions/blob/30175ec0c5d05d93000443448d8abfb554f99c7c/fabrictestbed_extensions/fablib/node.py#L646
Paul
-
AuthorPosts