Forum Replies Created
-
AuthorPosts
-
@Yoursunny thank you, but again running DPDK was never an issue. I was in-fact one of the first persons to run DPDK in Fabric environments and wrote a tutorial on Paul’s request containing instructions and a template Jupyter notebook on setting up and configuring DPDK in Fabric slices for future users.
The issue was something else and it turned out to be a lack of feature in Mellanox NICs/drivers itself.
Hi Ilya, thanks for the reply, my question was not about using/installing DPDK, but rather a specific set of capabilities that come with the DPDK enabled NIC. Turns out that mellanox does not infact implement these capabilities (QoS traffic management and traffic shaping) that I was wondering about and hence I was getting the error. On my local setup I had Intel’s smart NICs which had these capabilities.
I’ll fill out the request form too if required, thanks for guiding me about that.
It seems I would need the
SmartNIC tag to provision a VM with SmartNIC
for my project. Can you guys kindly help with that? Thank you.Thanks all, I then opted for RDMA’s iperf (ib_write_bw) and was able to get close to 95 Gbps.
Hi all, running Iperf I saw the max bandwidth I can get in VMs is around 19-20 Gbps, has this been set by Fabric and any way to increase this?
Also, any way to check the drop stats etc. in the switch? Or any way to directly connect 2 VMs? I am seeing some packet drops and upon checking NIC counters it seems the drops are occurring in the switch… Any suggestions on how to avoid that, other than congestion control? Maybe doing some config at switch side?
Thanks Ilya, yes I am not using management interfaces but just wanted to be clear in what I am interested in so mentioned them.
@Paul – Yes it’d be great if I can know the switches in MAX and TACC sites, are they 5500 or 5700, thanks.
Additionally, just to mention, I am interested in the connectivity and topology of the Connectx NICs I assigned IPs to myself and connected using “slice.add_l2network(name=network_name, interfaces=ifaces)” command, and not the management connections. So are the nodes then simply connected by a VLAN or something on top of the Cisco 5500 in that case?
Hi Ilya, where can I print out the slice/silver details from? I see slice details in “Experiments->My Slices” and can click on the slice to view its details, but it does not show how the nodes are actually physically connected. I also couldn’t find the site advertisement can you help me find that as well to know if they’re connected by a Cisco 5500 or 5700? Thanks a ton.
I get the above error, basically
RTX6000 not available in graph node: 8QPDZC3
, although resources page show it is. Anyways I’ll try to lower other resources but I think the error is mainly due to GPU availability here.———– ————————————
Slice Name ultima8
Slice ID d8933582-5702-4aee-ac54-0d63eb7ec74c
Slice State Closing
Lease End 2022-10-07 17:46:15 +0000
———– ————————————Retry: 0, Time: 33 sec
ID Name Site Host Cores RAM Disk Image Management IP State Error
———————————— —— —— ————————– ——- —– —— —————– ————— ——- —————————————————————————————-
9e1b80b3-8ff7-459a-a7cd-eab8f7a32a51 Node_0 TACC tacc-w1.fabric-testbed.net 16 64 100 default_ubuntu_20 Closed TicketReviewPolicy: Closing reservation due to failure in slice
4ab457f3-1cd0-4808-9c0a-7e8aad63a455 Node_1 TACC tacc-w1.fabric-testbed.net 16 64 100 default_ubuntu_20 Closed TicketReviewPolicy: Closing reservation due to failure in slice
b54cb2ff-6a76-4c60-b8ee-9ca4b728278d Node_2 TACC tacc-w2.fabric-testbed.net 16 64 100 default_ubuntu_20 Closed TicketReviewPolicy: Closing reservation due to failure in slice
1e35ad5d-f3e7-4a0c-8172-e4d3116946aa Node_3 TACC tacc-w2.fabric-testbed.net 16 64 100 default_ubuntu_20 Closed TicketReviewPolicy: Closing reservation due to failure in slice
6a706293-7c20-4c24-9830-76bf7cebc79e Node_4 TACC tacc-w2.fabric-testbed.net 16 64 100 default_ubuntu_20 Closed TicketReviewPolicy: Closing reservation due to failure in slice
e2a3f34c-bafb-4165-85bc-9e937ceda9e7 Node_5 TACC default_ubuntu_20 Closed Insufficient resources : Component of type: RTX6000 not available in graph node: 8QPDZC3
b4afdaf5-e133-4705-acdb-1ff564850718 Node_6 TACC default_ubuntu_20 Closed Insufficient resources : Component of type: RTX6000 not available in graph node: 8QPDZC3
55316f2c-0e02-40f4-8439-cb36d5ddc700 Node_7 TACC default_ubuntu_20 Closed Insufficient resources : Component of type: RTX6000 not available in graph node: 8QPDZC3Time to stable 33 seconds
Running post_boot_config … FAILURE: Slice is in Closing state; cannot do post boot config
Time to post boot config 33 secondsAll of the above stuck in Closing state for weeks.
c0284334-d0a4-45ea-9d5b-e9c76c9ec2c0
a87026b0-90a8-434f-af30-6c9288e7af04
35ccd9da-67da-4da3-be71-08a33bb6c926
2aa04965-b824-4f65-9093-2396ef58ad48
2988f08f-83a0-430b-a7d0-c70cf46a0b8c
9943eab4-d2b6-4a7b-9d27-5a41d6248b11
891d0cea-d834-4764-84a5-43ed55747170
0fd401cd-550a-42ed-becb-eead76d4a467
8efbccb2-6d59-4509-aa0e-54b6b404add9
767903ac-3af6-4c30-9e55-c4fa0536ab0b
54fb6706-ee71-4154-a430-4dbe4bbe3361
17c0cd6c-465f-4094-ae72-c67c5c5fb162Yeah I am able to create new ones with different names but some slices are stuck in “Closing” state for weeks now…
-
AuthorPosts