1. Paul Ruth

Paul Ruth

Forum Replies Created

Viewing 15 posts - 136 through 150 (of 273 total)
  • Author
    Posts
  • in reply to: Fabric nodes Network access #2538
    Paul Ruth
    Keymaster

      We are still in the construction phase of FABRIC.  Although there are a plans for 30+ sites, currently only 10 of them are deployed on the production testbed.   Many more sites are being delivered and/or configured. These will be released to the users as soon as they are ready.  There will eventually be a site at Clemson but there are no plans for one at UCLA. There are plans for a site in Los Angeles but not at UCLA.

      Regarding your question about latency: Every site has a management network that allows you to ssh to your VMs and pull software/data to them.  This network is intended to be used for configuration of experiments, it is not intended for experimentation.  This management network is connected to the public Internet (although protected by a bastion host).  If you ping “clemson.edu” or “ucla.edu” from your VMs, you are not pinging anything inside FABRIC.  Instead, you are pinging a host owned and managed by a university and you are sending traffic across the management network and the public Internet.

      If you want to test latency across FABRIC, you will need to setup VMs at multiple FABRIC sites and ping between them over a network you setup within FABRIC.  Note that on FABRIC you can deploy L2 or L3 networks and might see different latency depending on how you design your experiment.  With L2 networks, you could even design multiple networks with different paths between the same pair of sites. You could even experiment with dynamically choosing between these paths to optimize your application.

      • This reply was modified 2 years, 4 months ago by Paul Ruth.
      • This reply was modified 2 years, 4 months ago by Paul Ruth.
      • This reply was modified 2 years, 4 months ago by Paul Ruth.
      in reply to: Default bandwidth for Fabric nodes #2532
      Paul Ruth
      Keymaster

        You get full physical access to the NIC. The connectx-6 cards are 2x100G, connectx-5 are 2x25G, and the Basic NICs are connectx-6 SR-IOV VFs (100G but bandwidth is shared with other Basic NICs).  We don’t artificially limit bandwidth on any NICs.  Eventually, we have plans to have dedicated QoS for bandwidth across WAN connections, but the NICs themselves are physical NICs and have whatever bandwidth they were designed to have.

        You may see different bandwidths between sites. Some of that could be because other users are sharing the links and couple sites don’t yet have their permanent physical connections. However, we have not ramped up usage yet and nearly all of our networks links are minimally used.    I would not expect other users to significantly affect your WAN bandwidth right now and if they do it will be temporary.

        I do expect that achieving high bandwidths will require tuning of end hosts (and maybe core switches).  Soon we will try to do this ourselves and provide suggested tuning parameters but for now there is nothing artificial that prevents any user from achieving 100G across WAN FABRIC links.  We just having looked into the right tuning parameters yet.

        If you are interested in high bandwidths, I suggest starting with pairs of sites that are physically close to each other (UTAH-SALT, or WASH-MAX).  Low latency makes higher bandwidth a lot easier to achieve.

         

        Paul Ruth
        Keymaster

          This is a temporary bug from an update we just pushed.  Either wait a bit and a fix will be pushed or type “mkdir /tmp/fablib” and it will work for now.

          in reply to: User is not a member of project: #2397
          Paul Ruth
          Keymaster

            We have simplified the environment configuration process and this is a side effect. There are a couple ways to remedy this.

            • Get the updated jupyter-examples notebooks and edit/run the configure environment notebook.  This will create the config files you need and put them in fabric_config.
            • Alternatively, you could delete the existing fabric_config folder and stop/start you jupyter container.  This will reset the fabric_config files to sane defaults that allow existing notebooks to work with the new fablib.
            • This reply was modified 2 years, 4 months ago by Paul Ruth.
            Paul Ruth
            Keymaster

              I’m trying to recreate this but can not seem to intentionally trigger it. My guess is that is an issue related to the library having a temporary problem creating an ssh connection to the VM.

              One thing you can try is to manually re-run the post_boot_config step when you see this. You can do this by calling slice.post_boot_config(). If this fixes the slice, then this is probably a temporary ssh issue.

              Another thing to do is to look at the log file. By default it is at /tmp/fablib/fablib.log. There might be something in there that hints at what is happening. Be warned that fablib retries the ssh connection a few times on failure, so you may see ssh failures that were resolved.

              If you do see this again, could you try to include any relevant section of the log file in the message?

              in reply to: modifying device properties #2359
              Paul Ruth
              Keymaster

                Ah… there is a typo… “get_intefaces()” is missing an ‘r’.

                It should be:
                [ifaceRouterC,ifaceRouterS] = nodeRouter.add_component(model="NIC_ConnectX_5", name="cx5_nic").get_interfaces()

                • This reply was modified 2 years, 4 months ago by Paul Ruth.
                in reply to: modifying device properties #2357
                Paul Ruth
                Keymaster

                  I think the forum markup messed up the quotes in the code snippet I sent. Paste it in and re-type the quotes. It will work.

                  Also, that error is an unnecessary exception that is thrown by the currently deployed version of fablib when you have an interface that is not attached to a network. In your case, it is caused by the second port of the connectx-5’s on your nodes. You can safely ignore the error and it will work fine. Basically, fablib gets confused when it is tries to configure an interface that you are not using. This error will be suppressed in the next version of fablib.

                  in reply to: modifying device properties #2352
                  Paul Ruth
                  Keymaster

                    Yeah, that slice request requires 4 connectx-5’s and will need to be aware of their availability.

                    I have a few observations that may help:

                    – All 3 of your nodes are being sent to the same site.  You might try putting them on different sites.  From your perspective it will work about the same.  The main difference will be that the latency between nodes is greater.

                    – If you only need low level configuration for some interfaces, you could mix-and-match NIC_ConnectX_5 with NIC_Basic. For example, maybe your router needs a ConnectX_5 but the nodes can use a NIC_Basic (or the other way around).

                    – Your router node is asking for 2 connectx-5’s.  Each connectx-5 has two ports. If you really only need two ports you can use 1 connectx-5 for your router. The code for that will look something like:

                    connectx5_interfaces = nodeRouter.add_component(model=”NIC_ConnectX_5″, name=”cx5_nic”).get_interfaces()
                    ifaceRouterC = connectx5_interfaces[0]
                    ifaceRouterS = connectx5_interfaces[1]

                    or you can shorten it to:
                    [ifaceRouterC,ifaceRouterS] = nodeRouter.add_component(model=”NIC_ConnectX_5″, name=”cx5_nic”).get_interfaces()

                     

                    • This reply was modified 2 years, 4 months ago by Paul Ruth.
                    • This reply was modified 2 years, 4 months ago by Paul Ruth.
                    • This reply was modified 2 years, 4 months ago by Paul Ruth.
                    • This reply was modified 2 years, 4 months ago by Paul Ruth.
                    • This reply was modified 2 years, 4 months ago by Paul Ruth.
                    in reply to: modifying device properties #2349
                    Paul Ruth
                    Keymaster

                      Your project did not have permissions required to use smart NICs.  I just added the permissions.  Please try again.

                      in reply to: modifying device properties #2344
                      Paul Ruth
                      Keymaster

                        I think that might be a side effect of NIC_Basic’s being SRIOV VFs with limited low level configuration available.   You will have a bit more control if you use FABRIC’s dedicated NICs.

                        I tried running that ethtool command on a NIC_ConnectX_5 and it worked. I think the lower level control you need requires NIC_ConnectX_5 or NIC_ConnectX_6 NICs.

                        Let me know if that works for you.

                        Paul

                        in reply to: cannot login to reserved nodes #2341
                        Paul Ruth
                        Keymaster

                          Yeah, thats interesting. We probably don’t want passphrases stored in notebooks.  There is an update to fablib coming in a week or so that should streamline a lot of these config issues.  Part of it includes creating a fabric_rc file and a more sophisticated ssh config file.  Together, these will remove the nearly all to env vars and other config from the notebooks. I will note that we should probably do something clever with the notebook that creates the config files so that passprases don’t get stored in them.

                          Also,  the upload call should be a complete path including the file names. Like this:

                          node1.upload_file("/home/fabric/work/test_file", "/home/rocky/test_file")

                           

                          in reply to: Unable to pull github repository #2336
                          Paul Ruth
                          Keymaster

                            Github is one of the few major sites that doesn’t work with IPv6.

                            I would try the nat64.net option. The only real negative is that it is a free public service and could disappear someday.  In practice we have found it to work quite well.

                            One other option is to create a tarball of whatever you need to install and just transfer it in manually.

                            in reply to: cannot login to reserved nodes #2335
                            Paul Ruth
                            Keymaster

                              Oh, I misunderstood.

                              I think I saw in one of your other messages that you were using keys with passphrases.  I’m suspicious about handling of the passphrase being the issue here. Or maybe one (or both) of your keys don’t match our requirements exactly.

                              I’ll need to create some keys with passphrases to check and may sure that still works. I suspect if passphrases don’t work I would have heard complaints by now.  While I’m doing that, could you try some simple non-passphrase keyspairs?  For the bastion key, can you let the portal generate the key?  This would help narrow down where we need to look.

                              For reference, the key reqs are here: https://learn.fabric-testbed.net/knowledge-base/logging-into-fabric-vms/#ssh-keypair-primer-creating-identifying-fingerprinting-keypairs

                               

                              in reply to: get_physical_os_interface()[‘ifname’] failed #2330
                              Paul Ruth
                              Keymaster

                                Those are all great resources.

                                mtu = 9000 (jumbo frames) is important too.

                                With 100G NICs this part from fasterdata.es.net is important too:

                                We also strongly recommend reducing the maximum flow rate to avoid bursts of packets that could overflow switch and receive host buffers.
                                
                                For example for a 10G host, add this to a boot script:
                                
                                /sbin/tc qdisc add dev ethN root fq maxrate 8gbit
                                For for a host running data transfer tools that use 4 parallel streams, do this:
                                
                                /sbin/tc qdisc add dev ethN root fq maxrate 2gbit
                                Where 'ethN' is the name of the ethernet device on your system.

                                 

                                 

                                 

                                 

                                in reply to: cannot login to reserved nodes #2329
                                Paul Ruth
                                Keymaster

                                  You are not allowed to ssh to our bastion host directly. You can only jump through it.  The node.execute call does this for you using paramiko.   It does this by creating a “channel” using paramiko.

                                  If you want to replicate this in your own code the example is here: https://github.com/fabric-testbed/fabrictestbed-extensions/blob/30175ec0c5d05d93000443448d8abfb554f99c7c/fabrictestbed_extensions/fablib/node.py#L646

                                  Paul

                                Viewing 15 posts - 136 through 150 (of 273 total)