Home › Forums › FPGAs in FABRIC › Assistance completing the “fpga_simple_p4” tutorial notebook
- This topic has 11 replies, 3 voices, and was last updated 2 days, 22 hours ago by Luca Cetino.
-
AuthorPosts
-
October 21, 2024 at 5:37 am #7665
Hello everyone!
I’m approaching the FPGA deployment in FABRIC testbed to be able to program them with custom p4 code and perform some experiments.
Since it was new to me, I started trying to replicate the “fpga_simple_p4” notebook tutorial, but I couldn’t complete it.
I am able to correctly program the FPGA with the pre-built p4_only artifact (it is an example p4 switching application compiled from the esnet-smartnic-hw/examples git repo), and interact with it through the ESnet-SmartNIC workflow. I then reach the final steps of the notebook, I open and configure pktgen application as it is suggested in the tutorial, and I make it send packets towards the counterpart(s). Here is when the issue shows up: I see an initial burst of 2079 packets being transmitted and then pktgen stops. Any of these packets is captured on the counterpart(s). Pktgen is apparently a bit buggy, as I can’t quit it either (it gets stucks on the “terminated” message).
I tried almost every site in the US with FPGAs available. I only could complete the experiment in the DALL site as there this problem didn’t show up.I attach the output from pktgen below. For any additional details about the steps I took, refer to the official notebook tutorial or reach me out!
Thank you in advance for any help or leads!Luca
October 22, 2024 at 1:59 pm #7673Hi Luca,
Could you share the sites where you encountered this issue? I tried CLEM, and it worked fine.
As mentioned here, we collaborate with the experimenter to flash the FPGA with the initial bitstream. We’d like to rule out whether a different bitstream (other than ESnet) was used for flashing the FPGA at the sites where you experienced the problem. Also, if you have the slice up where you see the error, please share your slice ID with us!
Thanks,
KomalOctober 23, 2024 at 7:26 am #7674Hi Komal,
Thank you for your response.
I just created a new slice and tested it with the same error in SRI, it’s still up and its ID is: 6b02473e-4df5-445b-9dd6-e437a01f78b8.
Previously, I encountered the same issue also in KANS, GATECH, FIU, LOSA and few other sites.
The only slice working as expected and allowing me to see the results of the traffic generation has the FPGA node located in DALL. Its ID is ae3bfdac-dad4-4705-9ad8-fb8e4ab11e30.By further inspections, I can see using the sn-cli tool from the esnet-smartnic-fw image, that no counter on the probe stats of the device is being updated, everything is showing 0. This happens both when pktgen is and is not running (when I’m using the stack under the two profiles smartnic-mgr-vfio-unlock and smartnic-mgr-dpdk-manual).
I’m still available for any other information needed, meanwhile thanks again for your support.
Kindly,
LucaOctober 23, 2024 at 7:49 am #7675Thank you Luca for sharing the slice information. I will investigate this further and keep you posted.
Could you please extend the slices for atleast upto a week so they don’t expire?
As a first check, FPGAs on the GATECG, FIU, SRI seem to be flashed with a bitfile compatible with ESNet workflow. I will check about KANS and LOSA and confirm.
Thanks,
Komal
- This reply was modified 4 weeks, 1 day ago by Komal Thareja.
October 23, 2024 at 7:56 am #7677Correction, FIU has been flashed with bit file compatible with XDMA shell so may not work with ESNet workflow.
October 23, 2024 at 8:29 am #7678Thank you Komal for your quick response and update.
Would it be possible to have a list of the compatible sites in order to know where it’s worth to try to lock the resources?
Anyway, is it correct that I can proceed through every step of the setup and configuration, even if the site’s FPGA has a different bitfile flash on it? I can’t determine if that is the issue, since to me the slice in IRI doesn’t seem to work.
Thanks,
LucaOctober 23, 2024 at 3:01 pm #7679Hey Luca,
I was looking at your slice on SRI and noticed two containers
sn-stack-ubuntu-smartnic-cfg-1
andsn-stack-ubuntu-smartnic-p4-1
are restarting. I suspect that could be the reason for traffic issue.Your DALL slice is expired so I could not check there.
The logs in both of them suggest FPGA is not ready.
================================================================================
Created self-signed TLS certificate.
issuer=CN = localhost
subject=CN = localhost
notBefore=Oct 23 18:59:34 2024 GMT
notAfter=Oct 23 18:59:34 2025 GMT
X509v3 Subject Alternative Name:
DNS:smartnic-p4, DNS:localhost, DNS:localhost, IP Address:127.0.0.1, DNS:ip6-localhost, IP Address:0:0:0:0:0:0:0:1
================================================================================
Checking for FPGA readiness ... FPGA not ready.
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
388c54840920 smartnic-dpdk-docker:ubuntu-dev "/bin/bash -c -e -o …" 3 minutes ago Up 2 minutes sn-stack-ubuntu-smartnic-dpdk-1
76f7a24df81d esnet-smartnic-fw:ubuntu-dev "/bin/bash -c -e -o …" 3 minutes ago Up 2 minutes (healthy) sn-stack-ubuntu-smartnic-devbind-1
b5cca620505d esnet-smartnic-fw:ubuntu-dev "/usr/local/sbin/sn-…" 3 minutes ago Restarting (1) 59 seconds ago sn-stack-ubuntu-smartnic-cfg-1
9dbb7262d5d6 esnet-smartnic-fw:ubuntu-dev "/bin/bash -c -e -o …" 3 minutes ago Up 2 minutes sn-stack-ubuntu-smartnic-fw-1
380fcc8ad614 esnet-smartnic-fw:ubuntu-dev "/usr/local/sbin/sn-…" 3 minutes ago Restarting (1) 59 seconds ago sn-stack-ubuntu-smartnic-p4-1
a3972a1c0ce9 xilinx-labtools-docker:ubuntu-dev "/entrypoint.sh /bin…" 3 minutes ago Up 3 minutes (healthy) sn-stack-ubuntu-smartnic-hw-1
352f70e7da43 xilinx-labtools-docker:ubuntu-dev "/entrypoint.sh /bin…" 3 minutes ago Up 3 minutes (healthy) 3121/tcp sn-stack-ubuntu-xilinx-hwserver-1
4295b131ccd8 esnet-smartnic-fw:ubuntu-dev "/bin/bash -c -e -o …" 3 minutes ago Up 3 minutes sn-stack-ubuntu-smartnic-unpack-1
e9a84041e44e esnet-smartnic-fw:ubuntu-dev "/bin/bash -c -e -o …" 3 minutes ago Up 3 minutes sn-stack-ubuntu-xilinx-sc-console-1
Dev bind is successful:
No 'Regex' devices detected
===========================
+ lspci -D -kvm -s 0000:1f:00.0
+ grep '^Driver: vfio-pci'
Driver: vfio-pci
+ lspci -D -kvm -s 0000:1f:00.1
+ grep '^Driver: vfio-pci'
Driver: vfio-pci
+ touch /status/ok
+ sleep infinity
October 24, 2024 at 5:37 am #7680Hi Komal, thank you for inspecting what the error could be.
Indeed, the fact you pointed out seemed to be related to pktgen not working correctly. I found it to be caused by the stack being launched under the “smartnic-mgr-dpdk-manual”, wich locks the FPGA and prevents any interaction with it unless pktgen application is running. I solved (on the slice in SRI) by launching pktgen and manually restarting those two containers, that can now sense the FPGA as ready. As I mentioned, thanks to this pktgen is actually working now but the FPGA behavior is not what I would expect. Again I’m able to configure and start pktgen, but no packets are received at all on the other node(s).
I programmed the FPGA internal paths (using sn-cli tool) as described in the jupyter notebook (completely bypassing the P4 logic) but this doesn’t reflect on the traffic being correctly forwarded from host0/host1 to cmac0/cmac1.
I’d like to follow the paths taken by the traffic inside the smartnic, but the command sn-cli probe stats is giving me all 0s. This would have been hugely helpful to check whether and where the packets are dropped by the card and have a better idea on what the solution could be.
I am currently working on a similarly configured slice in LOSA + SEAT (ID: b73c3d3b-86ae-428e-8f4a-40095f8d36ec), there the packets don’t get lost, but the probe is not working either.I thank you a lot for your time and your support!
LucaOctober 24, 2024 at 5:08 pm #7683Hi Luca,
Not much luck with it! I can reproduce what you are observing on CLEM but haven’t found a resolution yet. However, I did notice that when I start pktgen and all the containers are up and running. I keep noticing following error in the container
sn-stack-ubuntu-smartnic-cfg-1
Probe reports few drops as soon as I start pktgen but after that it just keeps reporting all 0s.
Checking for FPGA readiness ... FPGA ready.
Starting server: sn-cfg-agent server --tls-cert-chain=/etc/letsencrypt/fullchain.pem --tls-key=/etc/letsencrypt/privkey.pem 0000:1f:00.0
--- PCI bus IDs:
------> 0000:1f:00.0
ERROR(cms_mailbox_post)[5 (Input/output error)]: packet error
--- UTC start time: 2024-10-24 20:33:02 +0000 [1729801982s.278712702ns]
ERROR(cms_mailbox_post)[5 (Input/output error)]: packet error
agent_server_run: Serving on [::]:50100
ERROR(cms_mailbox_post)[5 (Input/output error)]: packet error
ERROR(cms_mailbox_post)[5 (Input/output error)]: packet error
ERROR(cms_mailbox_post)[5 (Input/output error)]: packet error
ERROR(cms_mailbox_post)[5 (Input/output error)]: packet error
ERROR(cms_mailbox_post)[5 (Input/output error)]: packet error
ERROR(cms_mailbox_post)[5 (Input/output error)]: packet error
ERROR(cms_mailbox_post)[5 (Input/output error)]: packet error
ERROR(cms_mailbox_post)[5 (Input/output error)]: packet error
ERROR(cms_mailbox_post)[5 (Input/output error)]: packet error
ERROR(cms_mailbox_post)[5 (Input/output error)]: packet error
ERROR(cms_mailbox_post)[5 (Input/output error)]: packet error
ERROR(cms_mailbox_post)[5 (Input/output error)]: packet error
ERROR(cms_mailbox_post)[5 (Input/output error)]: packet error
ERROR(cms_mailbox_post)[5 (Input/output error)]: packet error
Thanks,
KomalOctober 29, 2024 at 4:32 am #7726Hello Komal!
Thank you for your last update, i needed some time to try figure out what is wrong with docker stack and why those packets between containers keep getting lost.
I wasn’t able to find the reason, however I found another useful tool which seems to be reporting correctly the traffic stats on the FPGA, it is sn-cfg –tls-insecure show switch stats –zeroes from inside the container smartnic-fw.
It helped me tracing explicitly where the packets were going and once or twice I was finally able to reconfigure the smartnic switch to have the traffic forwarded correctly.
One last error I’d like to know more about before closing this thread would be that somehow at certain point the FPGA is losing every packet that comes from the network, they are indeed reported in the drops_ovfl_from_cmac_0_pkt_count. The cmac status is UP as they’re enabled, they have the classic configuration with one queue per physical function (which isn’t related to cmacs) however I can’t control the queues for cmac and their depths.
That can be observed in my current slice with id b73c3d3b-86ae-428e-8f4a-40095f8d36ec.
What could that be due to? Any suggestion is indeed very helpful!Thank you so much for your time and your support so far.
Best regards,Luca.
October 29, 2024 at 5:14 pm #7734Hello all,
Luca, one thing to consider that we’ve noticed, is that if you load the same compiled P4 logic (even if you’re bypassing and not using it), for long periods, it will eventually be in a “stale” state, and the FPGA will drop all traffic. The simplest fix is to have an alternative logic compiled (for example, the p4_only example, and an ethernet packet forwarder), and use the second as a palate cleanser when the first stops working. All you have to do is bring up the stack with the second logic, then bring it down and bring it up with the first.
In all cases of pktgen issues that we’ve encountered so far, this has been the cause and in most cases it’s the 2079 packets issue, in others, the card drops all packets, or drops at an increasing rate until 100% is dropped.
November 18, 2024 at 10:27 am #7810Hello all!
First of all, thank you both for your support and for sharing with me the tips and the knowledge about how to better proceed.
Thanks to you, I was able to move steps forward and solve the issues that I originally described in this thread.
Though, I profit of the last response from you @Mohammad to further understand what could be going on in the situation you described.
As of today, I found at least one case where the solution you proposed (to re-flash the fpga with a new artifact in order to make it exit its eventual “stale” state) does not work. You can observe it in the last slice I tested my artifacts on (slice ID: b90e6974-0134-4763-ab07-ab67b9a0ff16), the FPGA in SRI is not working even if I recently programmed it with different artifacts. Again, I observe the strange behavior from the pktgen application (it only sends 2079 pkts and then stops by crashing, don’t even letting the user close it). Also, no packets are traced by the FPGA probes and none are actually forwarded by the switch I programmed it with.Thank you for any further leads.
Best regards,
Luca -
AuthorPosts
- You must be logged in to reply to this topic.