Forum Replies Created
-
AuthorPosts
-
Try doing the following at the begining of that cell. I think you just need to pull a full/new copy of the slice before you modify it.
slice=fablib.get_slice(<your_slice_name>)
That looks like a bug. We’ll fix it.
One workaround is to use the NIC by adding it to a network.
thanks,
Paul
Thanks. We will update that. It probably won’t be the default until the end of the semester so that we can keep everything stable for educational users.
April 1, 2023 at 4:59 pm in reply to: P4_bmv2 example is not working when executed on fabrictestbed. #4022Dedicated nics are either ConnectX-5 or ConnectX-6 (i.e. anything other that Basic NICs).
I’m sure the hardware supports this. The trick might be in how to handle it in a VM with devices using PCI passthrough. I’m not sure anyone has tried this yet.
Do you know how to do this on a non-virtualized machine?
Ezra and I looked a this a while back and it seemed that the Mellanox cards were not handling these frames the way we expected given the config options that we used. We’ll need to revisit this. I’ll get back to you about this.
thanks,
Paul
Each of those execute calls creates a separate ssh session that runs the the command. This means that each of the calls runs in its own shell and the result of any pervious ‘export’ or ‘source’ calls will not be represented in the environment of a new call.
In this case you are putting everything in the .bashrc file, so I would think that the new call would source the .bashrc file when the shell is created but maybe there is some other piece of the environment missing.
Try putting that all in one call, something like this:
node.execute('curl -O -L "https://golang.org/dl/go1.19.5.linux-amd64.tar.gz" ;' 'tar -xf "go1.19.5.linux-amd64.tar.gz" ;' 'sudo mv -v go /usr/local ;' 'echo "export GOPATH=$HOME/go" >> ~/.bashrc ;' 'echo "export PATH=$PATH:/usr/local/go/bin:$GOPATH/bin" >> ~/.bashrc ;' 'echo "export PATH=$PATH:/usr/local/go/bin" >> ~/.bashrc ;' 'source ~/.bashrc ;' 'go install github.com/named-data/YaNFD/cmd/yanfd@latest')
This is great. Note that I do have FRR/OSPF working without issue using shared/basic NICs. I’m using the Rocky image and FRR docker image. I’m not sure what is preventing your config from working.
The FRR/OSPF notebooks I am putting together are based on a yet-to-be-released version of FABlib. I’ll release the example when the new FABlib is released.
Paul
Are you asking to expose services to the public Internet? In general, we don’t want to expose internal FABRIC slices to the Internet. Our experience running testbeds has taught us that its too easy for slices to become compromised.
This is why we have the bastion host and are requiring you to use ssh tunnels or a proxy to intentionally expose internals. If a reverse proxy works for you then you can keep using it. If you do need to have a service exposed more generally, we can support that too, however, we will need to know more about what want to do and how you plan to keep it secure.
March 24, 2023 at 12:44 pm in reply to: P4_bmv2 example is not working when executed on fabrictestbed. #3988I think this might be an issue we have with Basic_NICs. The physical NICs sometimes filter legal L2 frames that should be allowed through. We are working on a solution but it might be a firmware issue with the NICs.
Can you try this with dedicated NICs?
You should be able to install just about any software you want in a FABRIC VM. I don’t know anything about Flask, but you should be able to set it up.
Can you describe what you have tried and where you are stuck?
Sorry for the delayed response.
The short answer is that there is no filtering on any of the network services and I have reliably set up OSPF using rocky VMs.
Are you saying that the VM receives the OSPF packets but does not send a response? If so, this seems like an issue with the OSPF config in the VM.
Can you share the notebook you are using?
Paul
Yes, that is correct.
It might be useful to add that with the SRI-OV NICs the virtual switching in the host is done in hardware. In most other cloud-ish systems the multi-tenant switching is done in software. These software switches use a lot of CPU which causes performance interference between the computation and networking (even between experiments). The SR-IOV solution should isolate the performance a bit more.
Also, it should be rare that there would be enough other active network-heavy experiments that you would see less than 10 or 20Gbps in the Basic NICs (I’m interesting testing this when we have more users). I think you could get a lot of good experiments done by using tc to rate limit your own traffic to some modest amount. (10-20 Gbps). I suspect, a lot of the network contention would be from your own experiment. Then, when you are ready, we could move you too a dedicated NIC an you could try higher bandwidths.
The Basic NICs/VFs are all best effort through the NIC itself with a cap of 100Gbps shared between all VFs on a physical host. It is possible that each would max out at ~780Mbps but it is extremely unlikely. In practice you will likely see 10s of Gbps. Currently, your max bandwidth will likely be very close to 100Gbps.
If you want dedicated bandwidth, you will need to reserve dedicated NICs. These NICs are dedicated to your experiment and are limited only by their hardware. When QoS reservations are available they will be on the WAN links between the sites. Most of these links will be 100Gbps that can be divided and allocated to individual experiments. The “super core” links will be 1200Gbps and can be divided and allocated as well. One of the main uses of the “super core” will be to allocate many dedicated 100Gbps QoS links to individual experiments.
Bandwidth QoS provisioning of WAN links is still being developed. Stay tuned.
- Basic NICs: The existing Basic NICs are implemented as SR-IOV virtual functions on a 100Gbps ConnectX-6. The only limitation is that the bandwidth is shared with the other Basic NICs on that port.
- ConnectX-6/5s: The dedicated ConnectX-6s come with 2 100Gpbs ports while the ConnectX-5s have 2 25Gbps ports. The dedicated ConnectX-6/5s are fully dedicate to a single VM and have full bandwidth to the switch.
Currently, there is little competition for bandwidth and you can see very nearly the full bandwidth in most cases (even with Basic NICs). This is especially true for connections that stay within a site.
WAN links vary in performance. Eventually, they will nearly all be on 100+ Gbps L1 connections owned by FABRIC. However, they are currently being deployed as fast as we can. In the mean time, many of the links are AL2S or other L2 service while we wait for the real links to be deployed. You will likely see lower bandwidth on these links.
Also, there are some quirks we are trying to work out where some SR-IOV NICs occasionally only get 25-30Gbps. It seem like they are being left in a weird state by a previous experiment. We are trying to figure out how to detect and reset these cases.
Generally, you should expect at lest 25Gbps and will often get close to 100Gbps. Note that in order to get these speeds you will need a bit more memory and cores than the default. Also, the app will need to be multi-threaded and many tools like iperf3 are single threaded even if you use ‘-P’ (https://fasterdata.es.net/performance-testing/network-troubleshooting-tools/iperf/multi-stream-iperf3/)
- This reply was modified 1 year, 9 months ago by Paul Ruth.
-
AuthorPosts