1. Paul Ruth

Paul Ruth

Forum Replies Created

Viewing 15 posts - 136 through 150 (of 274 total)
  • Author
    Posts
  • Paul Ruth
    Keymaster

      Do you always use STAR? I think this might be a problem with one of the RTX6000 GPUs at STAR.  I suspect you only get this error when your VM is placed on star-w2.  The error is probably happening more now because most of the RTX6000’s at STAR are allocated and you are more likely to get the bad one.

      For now try using a different site. I will have someone look at that GPU and see what is wrong with it.

       

       

      in reply to: Fabric nodes Network access #2538
      Paul Ruth
      Keymaster

        We are still in the construction phase of FABRIC.  Although there are a plans for 30+ sites, currently only 10 of them are deployed on the production testbed.   Many more sites are being delivered and/or configured. These will be released to the users as soon as they are ready.  There will eventually be a site at Clemson but there are no plans for one at UCLA. There are plans for a site in Los Angeles but not at UCLA.

        Regarding your question about latency: Every site has a management network that allows you to ssh to your VMs and pull software/data to them.  This network is intended to be used for configuration of experiments, it is not intended for experimentation.  This management network is connected to the public Internet (although protected by a bastion host).  If you ping “clemson.edu” or “ucla.edu” from your VMs, you are not pinging anything inside FABRIC.  Instead, you are pinging a host owned and managed by a university and you are sending traffic across the management network and the public Internet.

        If you want to test latency across FABRIC, you will need to setup VMs at multiple FABRIC sites and ping between them over a network you setup within FABRIC.  Note that on FABRIC you can deploy L2 or L3 networks and might see different latency depending on how you design your experiment.  With L2 networks, you could even design multiple networks with different paths between the same pair of sites. You could even experiment with dynamically choosing between these paths to optimize your application.

        • This reply was modified 2 years, 7 months ago by Paul Ruth.
        • This reply was modified 2 years, 7 months ago by Paul Ruth.
        • This reply was modified 2 years, 7 months ago by Paul Ruth.
        in reply to: Default bandwidth for Fabric nodes #2532
        Paul Ruth
        Keymaster

          You get full physical access to the NIC. The connectx-6 cards are 2x100G, connectx-5 are 2x25G, and the Basic NICs are connectx-6 SR-IOV VFs (100G but bandwidth is shared with other Basic NICs).  We don’t artificially limit bandwidth on any NICs.  Eventually, we have plans to have dedicated QoS for bandwidth across WAN connections, but the NICs themselves are physical NICs and have whatever bandwidth they were designed to have.

          You may see different bandwidths between sites. Some of that could be because other users are sharing the links and couple sites don’t yet have their permanent physical connections. However, we have not ramped up usage yet and nearly all of our networks links are minimally used.    I would not expect other users to significantly affect your WAN bandwidth right now and if they do it will be temporary.

          I do expect that achieving high bandwidths will require tuning of end hosts (and maybe core switches).  Soon we will try to do this ourselves and provide suggested tuning parameters but for now there is nothing artificial that prevents any user from achieving 100G across WAN FABRIC links.  We just having looked into the right tuning parameters yet.

          If you are interested in high bandwidths, I suggest starting with pairs of sites that are physically close to each other (UTAH-SALT, or WASH-MAX).  Low latency makes higher bandwidth a lot easier to achieve.

           

          Paul Ruth
          Keymaster

            This is a temporary bug from an update we just pushed.  Either wait a bit and a fix will be pushed or type “mkdir /tmp/fablib” and it will work for now.

            in reply to: User is not a member of project: #2397
            Paul Ruth
            Keymaster

              We have simplified the environment configuration process and this is a side effect. There are a couple ways to remedy this.

              • Get the updated jupyter-examples notebooks and edit/run the configure environment notebook.  This will create the config files you need and put them in fabric_config.
              • Alternatively, you could delete the existing fabric_config folder and stop/start you jupyter container.  This will reset the fabric_config files to sane defaults that allow existing notebooks to work with the new fablib.
              • This reply was modified 2 years, 7 months ago by Paul Ruth.
              Paul Ruth
              Keymaster

                I’m trying to recreate this but can not seem to intentionally trigger it. My guess is that is an issue related to the library having a temporary problem creating an ssh connection to the VM.

                One thing you can try is to manually re-run the post_boot_config step when you see this. You can do this by calling slice.post_boot_config(). If this fixes the slice, then this is probably a temporary ssh issue.

                Another thing to do is to look at the log file. By default it is at /tmp/fablib/fablib.log. There might be something in there that hints at what is happening. Be warned that fablib retries the ssh connection a few times on failure, so you may see ssh failures that were resolved.

                If you do see this again, could you try to include any relevant section of the log file in the message?

                in reply to: modifying device properties #2359
                Paul Ruth
                Keymaster

                  Ah… there is a typo… “get_intefaces()” is missing an ‘r’.

                  It should be:
                  [ifaceRouterC,ifaceRouterS] = nodeRouter.add_component(model="NIC_ConnectX_5", name="cx5_nic").get_interfaces()

                  • This reply was modified 2 years, 7 months ago by Paul Ruth.
                  in reply to: modifying device properties #2357
                  Paul Ruth
                  Keymaster

                    I think the forum markup messed up the quotes in the code snippet I sent. Paste it in and re-type the quotes. It will work.

                    Also, that error is an unnecessary exception that is thrown by the currently deployed version of fablib when you have an interface that is not attached to a network. In your case, it is caused by the second port of the connectx-5’s on your nodes. You can safely ignore the error and it will work fine. Basically, fablib gets confused when it is tries to configure an interface that you are not using. This error will be suppressed in the next version of fablib.

                    in reply to: modifying device properties #2352
                    Paul Ruth
                    Keymaster

                      Yeah, that slice request requires 4 connectx-5’s and will need to be aware of their availability.

                      I have a few observations that may help:

                      – All 3 of your nodes are being sent to the same site.  You might try putting them on different sites.  From your perspective it will work about the same.  The main difference will be that the latency between nodes is greater.

                      – If you only need low level configuration for some interfaces, you could mix-and-match NIC_ConnectX_5 with NIC_Basic. For example, maybe your router needs a ConnectX_5 but the nodes can use a NIC_Basic (or the other way around).

                      – Your router node is asking for 2 connectx-5’s.  Each connectx-5 has two ports. If you really only need two ports you can use 1 connectx-5 for your router. The code for that will look something like:

                      connectx5_interfaces = nodeRouter.add_component(model=”NIC_ConnectX_5″, name=”cx5_nic”).get_interfaces()
                      ifaceRouterC = connectx5_interfaces[0]
                      ifaceRouterS = connectx5_interfaces[1]

                      or you can shorten it to:
                      [ifaceRouterC,ifaceRouterS] = nodeRouter.add_component(model=”NIC_ConnectX_5″, name=”cx5_nic”).get_interfaces()

                       

                      • This reply was modified 2 years, 7 months ago by Paul Ruth.
                      • This reply was modified 2 years, 7 months ago by Paul Ruth.
                      • This reply was modified 2 years, 7 months ago by Paul Ruth.
                      • This reply was modified 2 years, 7 months ago by Paul Ruth.
                      • This reply was modified 2 years, 7 months ago by Paul Ruth.
                      in reply to: modifying device properties #2349
                      Paul Ruth
                      Keymaster

                        Your project did not have permissions required to use smart NICs.  I just added the permissions.  Please try again.

                        in reply to: modifying device properties #2344
                        Paul Ruth
                        Keymaster

                          I think that might be a side effect of NIC_Basic’s being SRIOV VFs with limited low level configuration available.   You will have a bit more control if you use FABRIC’s dedicated NICs.

                          I tried running that ethtool command on a NIC_ConnectX_5 and it worked. I think the lower level control you need requires NIC_ConnectX_5 or NIC_ConnectX_6 NICs.

                          Let me know if that works for you.

                          Paul

                          in reply to: cannot login to reserved nodes #2341
                          Paul Ruth
                          Keymaster

                            Yeah, thats interesting. We probably don’t want passphrases stored in notebooks.  There is an update to fablib coming in a week or so that should streamline a lot of these config issues.  Part of it includes creating a fabric_rc file and a more sophisticated ssh config file.  Together, these will remove the nearly all to env vars and other config from the notebooks. I will note that we should probably do something clever with the notebook that creates the config files so that passprases don’t get stored in them.

                            Also,  the upload call should be a complete path including the file names. Like this:

                            node1.upload_file("/home/fabric/work/test_file", "/home/rocky/test_file")

                             

                            in reply to: Unable to pull github repository #2336
                            Paul Ruth
                            Keymaster

                              Github is one of the few major sites that doesn’t work with IPv6.

                              I would try the nat64.net option. The only real negative is that it is a free public service and could disappear someday.  In practice we have found it to work quite well.

                              One other option is to create a tarball of whatever you need to install and just transfer it in manually.

                              in reply to: cannot login to reserved nodes #2335
                              Paul Ruth
                              Keymaster

                                Oh, I misunderstood.

                                I think I saw in one of your other messages that you were using keys with passphrases.  I’m suspicious about handling of the passphrase being the issue here. Or maybe one (or both) of your keys don’t match our requirements exactly.

                                I’ll need to create some keys with passphrases to check and may sure that still works. I suspect if passphrases don’t work I would have heard complaints by now.  While I’m doing that, could you try some simple non-passphrase keyspairs?  For the bastion key, can you let the portal generate the key?  This would help narrow down where we need to look.

                                For reference, the key reqs are here: https://learn.fabric-testbed.net/knowledge-base/logging-into-fabric-vms/#ssh-keypair-primer-creating-identifying-fingerprinting-keypairs

                                 

                                in reply to: get_physical_os_interface()[‘ifname’] failed #2330
                                Paul Ruth
                                Keymaster

                                  Those are all great resources.

                                  mtu = 9000 (jumbo frames) is important too.

                                  With 100G NICs this part from fasterdata.es.net is important too:

                                  We also strongly recommend reducing the maximum flow rate to avoid bursts of packets that could overflow switch and receive host buffers.
                                  
                                  For example for a 10G host, add this to a boot script:
                                  
                                  /sbin/tc qdisc add dev ethN root fq maxrate 8gbit
                                  For for a host running data transfer tools that use 4 parallel streams, do this:
                                  
                                  /sbin/tc qdisc add dev ethN root fq maxrate 2gbit
                                  Where 'ethN' is the name of the ethernet device on your system.

                                   

                                   

                                   

                                   

                                Viewing 15 posts - 136 through 150 (of 274 total)
                                FABRIC invites nominations for four awards recognizing innovative uses of FABRIC resources—Best Published Paper, Best FABRIC Matrix, Best FABRIC Experiment, and Best Classroom Use of FABRIC — submissions due by **Monday, February 24 at 11:59 PM ET**, and winners announced at KNIT10. [>>>Submit Form](https://docs.google.com/forms/d/e/1FAIpQLSeTp3i2iDhB7bHgN8ryMxZci8ya87yjeQd7_JMZImUodNinVA/viewform)

                                KNIT10 Call for Demos Now Open! Submit your demo by **February 24**. [>>>Submit Demo](https://docs.google.com/forms/d/e/1FAIpQLScRIWqHliNP3DFWBCnalYN_fBXJXVM0PpP9YWWJdSebC95TvA/viewform)
                                FABRIC invites nominations for four awards recognizing innovative uses of FABRIC resources—Best Published Paper, Best FABRIC Matrix, Best FABRIC Experiment, and Best Classroom Use of FABRIC — submissions due by **Monday, February 24 at 11:59 PM ET**, and winners announced at KNIT10. [>>>Submit Form](https://docs.google.com/forms/d/e/1FAIpQLSeTp3i2iDhB7bHgN8ryMxZci8ya87yjeQd7_JMZImUodNinVA/viewform)

                                KNIT10 Call for Demos Now Open! Submit your demo by **February 24**. [>>>Submit Demo](https://docs.google.com/forms/d/e/1FAIpQLScRIWqHliNP3DFWBCnalYN_fBXJXVM0PpP9YWWJdSebC95TvA/viewform)