Forum Replies Created
-
AuthorPosts
-
December 28, 2024 at 1:24 pm in reply to: Trouble with IPv4 Connectivity in a 3-Node Ubuntu 22 Cluster Using Shared NICs #7978
Hi Pinxiang,
Looking at your slice, you have 3 VMs connected to FabNetv4 service as you mentioned. But the IP addresses are not configured on the respective interfaces on the VMs, hence the traffic does not pass.
Could you please try Fabnetv4 example accessible via
start_here.ipynb?FABNet IPv4 (Layer 3): Connect to FABRIC’s IPv4 internet – it has 3 options auto, manual and full auto.
In the auto, and full auto, API takes care of configuring the IP addresses and traffic should pass on the IPv4 address while in the manual configuration, user is explicitly required to configure the IP addresses.
Please feel free to reach out in case of questions or concers.
Snippet from your VMs:
root@3bb1005a-6a0f-4b52-9c07-75d453b50813-node1:~# ifconfig -a
enp3s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 10.30.6.153 netmask 255.255.254.0 broadcast 10.30.7.255
inet6 2001:400:a100:3020:f816:3eff:fe23:bd75 prefixlen 64 scopeid 0x0<global>
inet6 fe80::f816:3eff:fe23:bd75 prefixlen 64 scopeid 0x20<link>
ether fa:16:3e:23:bd:75 txqueuelen 1000 (Ethernet)
RX packets 364541 bytes 306898628 (306.8 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 34407 bytes 3474334 (3.4 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0enp7s0: flags=4098<BROADCAST,MULTICAST> mtu 1500
ether 02:50:a9:17:fc:d4 txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 662 bytes 115088 (115.0 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 662 bytes 115088 (115.0 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0ubuntu@05441f94-5e35-4981-97d3-1ed1dac3381e-node3:~$ ifconfig -a
enp3s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 10.30.6.231 netmask 255.255.254.0 broadcast 10.30.7.255
inet6 fe80::f816:3eff:fec8:e21a prefixlen 64 scopeid 0x20<link>
inet6 2001:400:a100:3020:f816:3eff:fec8:e21a prefixlen 64 scopeid 0x0<global>
ether fa:16:3e:c8:e2:1a txqueuelen 1000 (Ethernet)
RX packets 368197 bytes 307203695 (307.2 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 37436 bytes 3716755 (3.7 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0enp7s0: flags=4098<BROADCAST,MULTICAST> mtu 1500
ether 02:fe:2e:df:af:a7 txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 556 bytes 95893 (95.8 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 556 bytes 95893 (95.8 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ubuntu@9ac56841-a123-4efa-9322-af75d3731819-node2:~$ ifconfig -a
enp3s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 10.30.6.23 netmask 255.255.254.0 broadcast 10.30.7.255
inet6 2001:400:a100:3020:f816:3eff:fe62:510a prefixlen 64 scopeid 0x0<global>
inet6 fe80::f816:3eff:fe62:510a prefixlen 64 scopeid 0x20<link>
ether fa:16:3e:62:51:0a txqueuelen 1000 (Ethernet)
RX packets 379700 bytes 308068634 (308.0 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 46612 bytes 4697307 (4.6 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0enp7s0: flags=4098<BROADCAST,MULTICAST> mtu 1500
ether 02:ef:84:b8:fd:09 txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 570 bytes 97991 (97.9 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 570 bytes 97991 (97.9 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0Thanks,
Komal
-
This reply was modified 10 months, 1 week ago by
Komal Thareja.
December 28, 2024 at 9:53 am in reply to: Trouble with IPv4 Connectivity in a 3-Node Ubuntu 22 Cluster Using Shared NICs #7976Hi Pinxiang,
Could you please share your Slice ID?
Thanks,
Komal
Hi Sean,
Could you please share the snapshot of the screen from the portal?
Experiments -> Manage SSH Keys -> Bastion
Please include the Bastion User name shown on this screen as well.
In addition, could you please try running the notebook
jupyter-examples-rel1.7.0/configure_and_validate.ipynbfrom the JH. This notebook validates your configuration, creates bastion and sliver keys if not present or expired. Please try a Hello example as well after this to verify if your keys and configuration is working.Thanks,
Komal
Hi Nirmala,
We currently do not have an example available for this. But I plan to work on one after the holidays and will share an update with you once I have a working version.
Thanks,
Komal
Hi Prateek,
We recently published the steps to launch a local JH container from your desktop and laptop.
Please consider giving this a try. Also, regarding your existing setup, could you please check
fabric_rcis pointing to the correct token location where you have uploaded the newly generated token.Another thing to verify would be to generate the token via an incognito browser window to ensure any stale cookies.
Please let me know if this helps or you still face issues!
Thanks,
Komal
Hi Ilya,
EDC and NCSA are connected to the same switch and share the same
/17subnet for FabNetv4 allocation. Between these two sites EDC(65) and NCSA(63), we currently have128active FabNetv4 services provisioned leaving no available subnet. The error message returned is not user friendly. I will fix the error message in the 1.8 update.Thanks,
Komal
Hi Tanay,
I wanted to check if this issue is still unresolved. I haven’t had a chance to look into it yet, but I plan to review the documentation and experiment with a few approaches. I’ll share any updates or findings here after the holidays.
Thanks,
KomalHi Rodrigo,
It’s possible that your bastion keys have expired. Could you please check the expiration of the keys from the Portal via Experiments -> Manage SSH Keys?
Also, please try running the notebook
jupyter-examples-rel1.7.1/configure_and_validate.ipynb?This notebook shall regenerate the bastion keys if the keys have expired. Please very SSH access after that.
Please let us know if you still see errors.
Thanks,
Komal
Thank you so much @Fraida! Could we please request you to consider uploading this to Fabric Artifacts to enable other Fabric users to leverage this?
Appreciate your help with this!
Artifact Manager: https://artifacts.fabric-testbed.net/artifacts/
December 9, 2024 at 11:52 am in reply to: Insufficient resources error despite available resources #7924Hi Jestus,
This error occurs when the host capable of provisioning the requested resource has run out of cores and RAM. While the resource view provides cumulative information for the entire site, checking resource availability at the host level offers more precise insights. This is available on the portal for each site resource view and also can be checked via API as shown by list_hosts in example here.
It’s possible that the combination of requested components (such as NICs or GPUs) maps to a host without sufficient cores or RAM, leading to the error you’ve encountered.
We have an example notebook (Additional Options: Validate Slice) available that allows you to validate resource availability beforehand using the API, which can be helpful prior to submitting a slice. Additionally, we’re working on changes to the allocation policy to better distribute VMs across hosts. This will help ensure that CPUs, RAM and disk are not fully allocated on single host which has SmartNICs and GPUs, minimizing such errors. These updates are planned for deployment in the January Release and should improve resource allocation.
Thanks,
Komal-
This reply was modified 11 months ago by
Komal Thareja.
December 5, 2024 at 12:09 pm in reply to: How to de;ete an interface from a node using the interface name #7910Please refer to this example for removing interfaces from a network as well as a node.
Thanks,
Komal
Possibly you changed the call from list_hosts to list_sites. Please see the snippet below.
None of the hosts on FIU have more than 3 GPUs. Also, even 3 can be requested based on availability.
The screenshot only shows the full capacity. You can also check this from portal too.
FIU per host information can be seen here: https://portal.fabric-testbed.net/sites/FIU
Thanks,
Komal
Hi Abdulhadi,
The GPU count you are referring to represents the total number of GPUs available at a site.
No single host at a site has more than 3 GPUs. In fact, only a few hosts are equipped with 3 GPUs. To check the per-host resource details, you can use the notebook:
jupyter-examples-main/fabric_examples/fablib_api/sites_and_resources/list_all_resources.ipynb.For convenience, the following code snippet can also be used:
from fabrictestbed_extensions.fablib.fablib import FablibManager as fablib_manager
fablib = fablib_manager()
fablib.show_config();
fields=['name', 'tesla_t4_capacity','rtx6000_capacity', 'a30_capacity', 'a40_capacity']
output_table = fablib.list_hosts(fields=fields)
Thanks,
Komal
Hi Ilya,
I looked into your slice and found that it was partially renewed, with the VM on STAR not renewing completely.
This appears to be a side effect of the Kafka maintenance we conducted yesterday, which impacted STAR. During this time, renewal messages were not processed because the Kafka consumer had stopped. I’ve resolved the issue, and future renewals should now work as expected.
Thank you for bringing this to our attention and helping us identify and fix the problem.
Best regards,
KomalP.S: Another user also ran into this: https://learn.fabric-testbed.net/forums/topic/not-able-to-renew-the-slice/
Hi Sankalpa,
Both your slices were partially renewed. Each slice included a VM on STAR, where the renewal process was stuck.
We use a Kafka messaging bus, and there was a brief maintenance yesterday that impacted STAR. As a result, renewal messages were not processed because the Kafka consumer had stopped. I have resolved this issue, and all the slivers in your slices have been successfully renewed. Your slice is now in the StableOK state.
Thank you for reporting this and helping us identify and address the problem.
Best regards,
Komal -
This reply was modified 10 months, 1 week ago by
-
AuthorPosts