1. Paul Ruth

Paul Ruth

Forum Replies Created

Viewing 15 posts - 31 through 45 (of 274 total)
  • Author
    Posts
  • in reply to: exceptions when adding a node to an existing slice #4206
    Paul Ruth
    Keymaster

      Try doing the following at the begining of that cell.  I think you just need to pull a full/new copy of the slice before you modify it.

      slice=fablib.get_slice(<your_slice_name>)

       

      in reply to: KeyError When creating a Slice #4051
      Paul Ruth
      Keymaster

        That looks like a bug. We’ll fix it.

        One workaround is to use the NIC by adding it to a network.

        thanks,

        Paul

         

        in reply to: Local NAS storage/VM #4031
        Paul Ruth
        Keymaster

          Thanks. We will update that. It probably won’t be the default until the end of the semester so that we can keep everything stable for educational users.

          Paul Ruth
          Keymaster

            Dedicated nics are either ConnectX-5 or ConnectX-6 (i.e. anything other that Basic NICs).

             

            in reply to: Network to GPU Hardware Support #4021
            Paul Ruth
            Keymaster

              I’m sure the hardware supports this.  The trick might be in how to handle it in a VM with devices using PCI passthrough.   I’m not sure anyone has tried this yet.

              Do you know how to do this on a non-virtualized machine?

              in reply to: L2Bridge without MAC learning? #4020
              Paul Ruth
              Keymaster

                Ezra and I looked a this a while back and it seemed that the Mellanox cards were not handling these frames the way we expected given the config options that we used.  We’ll  need to revisit this.   I’ll get back to you about this.

                thanks,

                Paul

                in reply to: go does not work with node.execute API #4019
                Paul Ruth
                Keymaster

                  Each of those execute calls creates a separate ssh session that runs the the command.  This means that each of the calls runs in its own shell and the result of any pervious ‘export’ or ‘source’ calls will not be represented in the environment of a new call.

                  In this case you are putting everything in the .bashrc file, so I would think that the new call would source the .bashrc file when the shell is created but maybe there is some other piece of the environment  missing.

                  Try putting that all in one call, something like this:

                  node.execute('curl -O -L "https://golang.org/dl/go1.19.5.linux-amd64.tar.gz" ;'
                                'tar -xf "go1.19.5.linux-amd64.tar.gz" ;'
                                'sudo mv -v go /usr/local ;'
                                'echo "export GOPATH=$HOME/go" >> ~/.bashrc ;' 
                                'echo "export PATH=$PATH:/usr/local/go/bin:$GOPATH/bin" >> ~/.bashrc ;' 
                                'echo "export PATH=$PATH:/usr/local/go/bin" >> ~/.bashrc ;' 
                                'source ~/.bashrc ;' 
                                'go install github.com/named-data/YaNFD/cmd/yanfd@latest')
                  
                  
                  in reply to: Error in exchanging ospf protocol routes #4003
                  Paul Ruth
                  Keymaster

                    This is great. Note that I do have FRR/OSPF working without issue using shared/basic NICs.  I’m using the Rocky image and FRR docker image.  I’m not sure what is preventing your config from working.

                    The FRR/OSPF notebooks I am putting together are based on a yet-to-be-released version of FABlib.  I’ll release the example when the new FABlib is released.

                    Paul

                    in reply to: Deploying a containerized application #4002
                    Paul Ruth
                    Keymaster

                      Are you asking to expose services to the public Internet? In general, we don’t want to expose internal FABRIC slices to the Internet. Our experience running testbeds has taught us that its too easy for slices to become compromised.

                      This is why we have the bastion host and are requiring you to use ssh tunnels or a proxy to intentionally expose internals.  If a reverse proxy works for you then you can keep using it. If you do need to have a service exposed more generally, we can support that too, however, we will need to know more about what want to do and how you plan to keep it secure.

                       

                      Paul Ruth
                      Keymaster

                        I think this might be an issue we have with Basic_NICs.  The physical NICs sometimes filter legal L2 frames that should be allowed through.  We are working on a solution but it might be a firmware issue with the NICs.

                        Can you try this with dedicated NICs?

                         

                        in reply to: Deploying a containerized application #3987
                        Paul Ruth
                        Keymaster

                          You should be able to install just about any software you want in a FABRIC VM.   I don’t know anything about Flask, but you should be able to set it up.

                          Can you describe what you have tried and where you are stuck?

                          in reply to: Error in exchanging ospf protocol routes #3986
                          Paul Ruth
                          Keymaster

                            Sorry for the delayed response.

                            The short answer is that there is no filtering on any of the network services and I have reliably set up OSPF using rocky VMs.

                            Are you saying that the VM receives the OSPF packets but does not send a response?  If so, this seems like an issue with the OSPF config in the VM.

                            Can you share the notebook you are using?

                            Paul

                            in reply to: Bandwidth on FABRIC links #3963
                            Paul Ruth
                            Keymaster

                              Yes, that is correct.

                              It might be useful to add that with the SRI-OV NICs the virtual switching in the host is done in hardware.  In most other cloud-ish systems the multi-tenant switching is done in software.  These software switches use a lot of CPU which causes performance interference between the computation and networking (even between experiments).  The SR-IOV solution should isolate the performance a bit more.

                              Also, it should be rare that there would be enough other active network-heavy experiments that you would see less than 10 or 20Gbps in the Basic NICs (I’m interesting testing this when we have more users).  I think you could get a lot of good experiments done by using tc to rate limit your own traffic to some modest amount. (10-20 Gbps). I suspect, a lot of the network contention would be from your own experiment.   Then, when you are ready, we could move you too a dedicated NIC an you could try higher bandwidths.

                              in reply to: Bandwidth on FABRIC links #3961
                              Paul Ruth
                              Keymaster

                                The Basic NICs/VFs are all best effort through the NIC itself with a cap of 100Gbps shared between all VFs on a physical host.  It is possible that each would max out at ~780Mbps but it is extremely unlikely.  In practice you will likely see 10s of Gbps.  Currently, your max bandwidth will likely be very close to 100Gbps.

                                If you want dedicated bandwidth, you will need to reserve dedicated NICs.   These NICs are dedicated to your experiment and are limited only by their hardware.  When QoS reservations are available they will be on the WAN links between the sites. Most of these links will be 100Gbps that can be divided and allocated to individual experiments.  The “super core” links will be 1200Gbps and can be divided and allocated as well.   One of the main uses of the “super core” will be to allocate many dedicated 100Gbps QoS links to individual experiments.

                                in reply to: Bandwidth on FABRIC links #3956
                                Paul Ruth
                                Keymaster

                                  Bandwidth QoS provisioning of WAN links is still being developed.  Stay tuned.

                                  • Basic NICs: The existing Basic NICs are implemented as SR-IOV virtual functions on a 100Gbps ConnectX-6.  The only limitation is that the bandwidth is shared with the other Basic NICs on that port.
                                  • ConnectX-6/5s: The dedicated ConnectX-6s come with 2 100Gpbs ports while the ConnectX-5s have 2 25Gbps ports.  The dedicated ConnectX-6/5s are fully dedicate to a single VM and have full bandwidth to the switch.

                                  Currently, there is little competition for bandwidth and you can see very nearly the full bandwidth in most cases (even with Basic NICs).  This is especially true for connections that stay within a site.

                                  WAN links vary in performance. Eventually, they will nearly all be on 100+ Gbps L1 connections owned by FABRIC. However, they are currently being deployed as fast as we can.  In the mean time, many of the links are AL2S or other L2 service while we wait for the real links to be deployed. You will likely see lower bandwidth on these links.

                                  Also, there are some quirks we are trying to work out where some SR-IOV NICs occasionally only get 25-30Gbps.  It seem like they are being left in a weird state by a previous experiment.  We are trying to figure out how to detect and reset these cases.

                                  Generally, you should expect at lest 25Gbps and will often get close to 100Gbps.  Note that in order to get these speeds you will need a bit more memory and cores than the default.  Also, the app will need to be multi-threaded and many tools like iperf3 are single threaded even if you use ‘-P’ (https://fasterdata.es.net/performance-testing/network-troubleshooting-tools/iperf/multi-stream-iperf3/)

                                   

                                  • This reply was modified 1 year, 10 months ago by Paul Ruth.
                                Viewing 15 posts - 31 through 45 (of 274 total)