Forum Replies Created
-
AuthorPosts
-
Eventually, the sliver keys will be managed by the portal and will be automatically deployed to the VMs. For now, you need to specify a key pair in the API.
At the top of every example notebook there is this:
Set the keypair FABRIC will install in your slice. os.environ['FABRIC_SLICE_PRIVATE_KEY_FILE']=os.environ['HOME']+'/.ssh/id_rsa' os.environ['FABRIC_SLICE_PUBLIC_KEY_FILE']=os.environ['HOME']+'/.ssh/id_rsa.pub'
In the JupyterHub the ~/.ssh/id_rsa key pair is created for you and used by default. If you set the API up somewhere else (e.g. your laptop), you will need to specify a path to a pair of keys (I’m just now noticing that it calls it “slice” key here and ‘sliver’ key on the portal).
When key management is setup on the portal you will only need to specify the name of the key pair to use.
For now, ssh/scp from a command line needs your bastion key in the ~/.ssh/config file and the “slice/sliver” key following the -i on the command line.
- This reply was modified 2 years, 7 months ago by Paul Ruth.
- This reply was modified 2 years, 7 months ago by Paul Ruth.
- This reply was modified 2 years, 7 months ago by Paul Ruth.
RE: copying large files off VMs.
Option 1: Transfer through the bastion host
All VMs have a management network that you are using to ssh to your VMs. You probably noticed that we are requiring users to jump through a bastion host to access their VMs (this increases security).
The easiest way to transfer data to/from a VM is with scp (or similar tool). The trick is that you still need to jump through the bastion host. To do this you will need to configure the external machine (your laptop, etc.) to jump use the bastion host for regular ssh.
SSH config info can be found here: https://learn.fabric-testbed.net/knowledge-base/logging-into-fabric-vms/
The main thing you need to get from that document is that you need to create (or add to) your ~/.ssh/config. Add config info about the bastion hosts. Basic config could look like this:
Host bastion-*.fabric-testbed.net User <bastion_username> IdentityFile /path/to/bastion/private/key
Then you should be able to ssh like this:
ssh -i /path/to/vm/private/key -J bastion-1.fabric-testbed.net <vm_username>@<vm_ip>
You should also be able to scp like this:
scp -i /path/to/vm/private/key -o "ProxyJump bastion-1.fabric-testbed.net" <vm_username>@<vm_ip>:file.txt .
Note that if the VM IP is IPv6 then the colons confuse the parsing and you need to do something like this (escape chars are necessary on OSX but not bash. Using escape chars works either way. I’m not sure how to do this on Windows):
scp -i /path/to/vm/private/key -o "ProxyJump bastion-1.fabric-testbed.net" ubuntu@\[2001:400:a100:3030:f816:3eff:fe93:a1a0\]:file.txt .
Option 2: Push directly from the VM
You can push data from a VM to a service on the Internet. The tricky part with doing this is that some VMs are connect to the IPv4 Internet using NAT, but most use IPv6. If you are using an IPv4 site, your transfer will need to work through a NAT. If you are on an IPv6 site, your transfer will need to be to a service that supports IPv6.
Also, FABRIC is itself an IPv6 AS and will soon peer with the public IPv6 Internet. We also have a small IPv4 allocation. Eventually, it will be possible to connect experiment dataplanes to the public Internet (although security concerns will likely limit what you can do with this feature).
- This reply was modified 2 years, 7 months ago by Paul Ruth.
- This reply was modified 2 years, 7 months ago by Paul Ruth.
- This reply was modified 2 years, 7 months ago by Paul Ruth.
- This reply was modified 2 years, 7 months ago by Paul Ruth.
The /home/fabric/work dir in the JupyterHub environment is intended to be persistent storage of code, notebooks, scripts, etc. related to configuring and running experiments. This includes adding extra Python modules. However, it is not really intended to support large data sets.
We are currently implementing a service for storing larger data sets in FABRIC. This service will look more like a persistent (or long-lived) storage sliver that can be attached and re-attached to many slices over time.
It is also important to note that, currently, the JupyterHub is hosted by a local machine at RENCI but we are in the process of moving it to an existing production Kubernetes cluster managed by our IT staff. This should increase the reliability and performance of the whole system. We are still discussing the resource caps that will be possible on this new cluster but the intent is to support code and very small data but not large data sets. Also, the transition to the production Kubernetes will be tricky. I would recommend creating backups of any valuable data before then and probably for a little while after that until we discover/fix any bugs in the new setup.
When this is all finalized, the target is to have Jupyter environments with persistent storage quotas big enough to install any packaged needed to run and manage experiments. Larger data sets and should use the yet-to-be-deployed persistent storage.
There are some issues related to the renewing a slice that will be fixed with an update that is coming in the next few weeks. I don’t know if this part of the fix, but it might be. It might be best to not rely on renewal until then.
In general, it is best to automate the deployment of any experiment using Jupyter notebooks or other scripts. This way you can shutdown your slice when you are not actively working on it and easily restart it the next day when you start working again. This is especially important if you are using scarce resources that other experimenters might want to use. It also helps you navigate any scheduled (or unscheduled) downtime the testbed might experience. In addition, it also helps you publish repeatable experiments that can easily be shared with others who can re-run your experiment on their own.
Long-lived slices, that require renewals, should be reserved for times when you are running an experiment that needs to actively run and collect data for longer periods of time. If you are just trying to set up an experiment, you will be more successful if you incorporate automation into your workflow from the beginning. In my experience with testbeds, the users who do not automate the deployment of their experiment have a lot harder time getting things to work and, in the end, don’t really know what is in the base environment they are using for their experiment.
I’m not really sure what happened here but it seems like something inside the VM changed. I recommend recreating the slice.
Let us know if you see issues like this again. I would be helpful if you could create a Jupyter notebook that recreates the error so we can investigate further.
We checked and your VM is there but we can’t ping it or ssh to it. Is seems like there is something misconfigured in the VM. Maybe a misconfigured IP or route.
Were you ever able to ssh to the VM? Does your experiment involve re-configuring IPs or routes? I’m wondering if any changes that your experiment made to the VM misconfigured an IP or route locked you out.
Paul
I think your slice and nodes have expired and no longer exist. By default a slice will expire after 24 hours.
If you have a fablib application that queried for your slice before it expired you might see a resulting status from the old query. Try getting the slice again and I suspect you will see that it no longer exists.
Paul
- This reply was modified 2 years, 8 months ago by Paul Ruth.
This will be as stable as we can make it. The idea is that this is our long-term user-facing API that can hide any changes in the underlying control framework.
There will likely be a lot of additions to the API but our goal is to keep existing features as stable as possible. If/when we find a feature that must be changed, we will try to keep a deprecated version around for a while in parallel to the updated feature.
Feel free to write applications against this API and suggest any additions/changes.
thanks,
PaulYour code is trying to unpack the return value of upload file to the tuple (stdout,sdterr). The file upload is not a shell command and does not have a stdout/stderr.
Try setting the return value to a single variable. I think its a single string that shows the file attributes. The error is because you are trying to unpack that string into a tuple.
The path should be to wherever the file you want to upload is located.
For example, if you had a file in the base directory of your JupyterHub environment and you wanted to upload it to the home directory of the VM, you might use something like the following:
mynode.upload_file('/home/fabric/work/myfile.txt', 'myfile.txt')
This is a known bug. The problem is with the reporting of the new expiration time. The extension was likely successful but when you try to verify the new time it is not updated. There is a fix for this that will be in the next release.
The way to upload a file is to use a file transfer tool that uses ssh and can hop through the bastion host. The most common of these is
scp
but there are others that might have better performance.FABLib includes an easy way to upload any file using paramiko. Any
node
object has a method calledupload_file
. This method will copy a file from the local host (i.e. your laptop or JupyterHub environment) to the remote node. It will put the file on the remote host in the location specified. It operates as the default user so there may be issues related to permission where you can write the file.Example:
myNode.upload_file('/path/to/local/file', '/path/to/remote/file')
More information about the FABLib upload method (and other functionality) can be found here: https://learn.fabric-testbed.net/docs/fablib/fablib.html#node.Node.upload_file
Paul
- This reply was modified 2 years, 8 months ago by Paul Ruth.
- This reply was modified 2 years, 8 months ago by Paul Ruth.
There are a few things to unpack here so I hope this helps:
Existing FABRIC Links: Currently, we are deploying links as they become available. None of the 1 Tbps links are ready yet. Many of the links we have deployed are dedicated 100 Gbps links but some those are also not ready so we are temporarily using Internet2 AL2s L2 connections until our dedicated links are ready. The AL2S connections we are using vary in bandwidth. They depend on the level AL2s service that exists at the site’s host institution. The most common level of service is 10 Gbps. If you are getting exactly 10Gbps, you are likely using one of these links. Also, some of the current links do not match the final topology that are working toward.
Summary: a few of our links are 100 Gbps but some are 10 Gbps. These are your current theoretical limits.
Achieving theoretical bandwidth limits: This can be quite challenging in practice. There are several resources online that can help. These are a good start:
https://fasterdata.es.net/host-tuning/linux/
https://srcc.stanford.edu/100g-network-adapter-tuning
Eventually, we will need to do a full performance test of all of our high bandwidth links. We started looking at them a bit but haven’t really been able to complete that work yet. We have shared some of our testing/debugging notebooks in the Jupyter examples repo. The notebook at the link below is not really complete and is a bit dated in terms of the FABlib version it is using, but I think it might help you tune for iperf tests. We we were playing with it earlier we were getting 25-30 Gbps across 100 Gbps wide-are links (which isn’t that great but is a good start). Its worth noting that most of the tuning targets IP and TCP so it might not apply directly to your NDN work. You probably need to figure out how to apply the TCP/IP tuning concepts to your NDN configuration.
Other considerations:
- VM size. You won’t be able to get high bandwidths without many cores and memory. What size VMs are you using? Try at least 16 cores/64 GB memory (even more would be better). This will be especially true if for you NDN routers which, I assume, do some processing to help find/get/put files where then need to be.
- VM image placement: Which hosts are your VMs on? When using Basic NICs you are using 100Gbps SR-IOV virtual functions. The 100Gbps bandwidth used by these VFs is shared by all VFs on that particular host. If many of your VMs are on the same host, you may be competing with yourself for bandwidth.
- Consider limiting the bandwidth your application is trying to use. One thing that happens with higher bandwidth end-hosts (common in 40/100 Gbps) is that you try to send 100Gpbs but some step along the path to the destination is slower (often significantly slower) than your end host. This causes a lot of packet loss at that step. High packet loss, in turn, tiggers the TCP to backoff to extremely low bandwidth. This has been shown to happen with even small amounts of packet loss in high latency environments. I assume your NDN work is not using TCP. If you are tunneling you are probably using UDP and if you are trying a native deployment (which FABRIC is perfect for) you are using lower level packet sending protocol. If this protocol does not backoff to a reasonable speed your will likely see a lot of packet loss and very low speeds.
I’ve probably left out something but this should be useful. I think
Re: the unpacking error with file upload: That is because file upload returns single string that describes the file attributes. It’s very common to assume that it returns a tuple of stdout, stderr like the execute method does… but upload is not a command so there is no stdout,stderr.
The error you are seeing is when you do this:
stdout,stderr = node.upload(...)
It will try to unpack the single string to the tuple stdout,stderr but it fails. Of course, this happens upon processing the return value, so the upload has already happened. Instead do something like:
file_attributes = node.upload(...)
Re: the warning about hairpins: This is a low-level limitation of using the wide-area (aka Site-to-site) networks with Basic NICs (which are implemented as SR-IOV virtual functions). The problem is that if you have a S2S network with multiple Basic NICs that end up on the same physical host, the physical switch can’t connect the two Basic NICs because they are using the same port on the switch (i.e. it can’t create a hairpin on the port).
This warning is printed in a low level of the library and FABLib cannot suppress it (I’ve asked for it to be removed and I think it will be soon). There are several cases where you can be sure there will be no problems.
- It will be fine if you only have on Basic NIC per site. Often this is because you only have two Basic NICs, one on either end of a WAN link connecting two routers or switches (this is your case).
- Alternatively, you can explicitly choose different host for each node with the
node.set_host()
. This will work, but can be tricky because there are a limited number hosts per site and you need to ensure that the host you pick has any other components types you might have added you your node. - Also, you could just use dedicated NICs, but I wouldn’t recommend that if you can avoid it. There are not that many and you will likely have issues finding sites with enough available for your experiment.
Ultimately, FABlib should be able to help automatically picking hosts or at least testing the request to see if it should will work. Right now, that warning appears any time you have a Basic NIC on a STS network… even if you know its fine.
Its worth mentioning here that this is not an issue with networks that are completely in one site (aka L2Bridges). These are implemented a slightly different way which allows for hairpins.
Basically, the suggestion is that you should try not to put a lot of nodes on one S2S link unless you really need to. Most real word scenarios don’t really work that way anyway.
I have a notebook I wrote with you mind that shows how to deploy a set of routers at a few sites… I’ll set you know when it is ready.
- This reply was modified 2 years, 9 months ago by Paul Ruth.
- This reply was modified 2 years, 9 months ago by Paul Ruth.
This is great. I think we are on same page here.
Please use as many sites as you want. It would be great to have more users deploying more complicated topologies. It’s what the testbed is for and will help us find more bugs and usability issues. If you have 5 routers you should use 5 sites!
For now there are very few users using resources anyway. And if you use Basic NICs you can create a full connected topology without using too many resources. However, a fully connected topology won’t work as well as you might expect (see next paragraph)
For you application, you may want to map your topology to the underlying available topology. In other words, you may want to have routers at many (most? all?) sites and have links between them that correspond the physical links that are available. It is possible to request direct links between any two sites but if the sites are not directly connected, the path we provide will bounce through one or more other sites. Particularly for projects like yours, it will be useful to consider paths in your requested topology that are, ultimately, mapped to the same underlying physical links. In these cases, you may wish to add another router to your topology where the actual paths join together.
Also, note that we are currently deploying new sites and links at a rapid pace so your ideal topology today might be different than next month (or even next week). You might just want to design your topology against the full FABRIC deployment. It work correctly now and will give the highest performance later.
-
AuthorPosts