Home › Forums › FABRIC General Questions and Discussion › When Creating a Slice, Sometimes Fails to Get NIC Components Correctly
- This topic has 6 replies, 2 voices, and was last updated 2 years, 5 months ago by Paul Ruth.
-
AuthorPosts
-
July 18, 2022 at 2:21 pm #2376
I’ve been using a Jupyter notebook to make a 2-node slice with NIC components on each node, and I noticed that, quite frequently, the node will be built but the NIC components will have no mac addresses and be unresponsive.
Here’s the code snippet where I’m building the slice:
try: #Create Slice slice = fablib.new_slice(name=slice_name) # Node1 node1 = slice.add_node(name=node1_name, site=site, image='default_ubuntu_22') node1.add_component(model='GPU_RTX6000', name=node1_gpu_name) node1.set_capacities(cores=2, ram=8, disk=10) iface1 = node1.add_component(model='NIC_Basic', name=node1_nic_name).get_interfaces()[0] # Node2 node2 = slice.add_node(name=node2_name, site=site, image='default_ubuntu_22') node2.set_capacities(cores=2, ram=8, disk=10) node2.add_component(model='GPU_RTX6000', name=node2_gpu_name) iface2 = node2.add_component(model='NIC_Basic', name=node2_nic_name).get_interfaces()[0] # Network net1 = slice.add_l2network(name=network_name, interfaces=[iface1, iface2]) #Submit Slice Request slice.submit() except Exception as e: print(f"Exception: {e}")
And here’s the output:
----------- ------------------------------------ Slice Name ToyModel Slice ID fba93c48-c269-41f0-8ab9-c2a4c727490a Slice State StableOK Lease End 2022-07-19 19:02:44 +0000 ----------- ------------------------------------ Retry: 16, Time: 197 sec ID Name Site Host Cores RAM Disk Image Management IP State Error ------------------------------------ ------ ------ -------------------------- ------- ----- ------ ----------------- -------------------------------------- ------- ------- af87b01f-bce6-44e6-8644-632b24ef5da1 Node1 STAR star-w1.fabric-testbed.net 2 8 10 default_ubuntu_22 2001:400:a100:3030:f816:3eff:feae:5e3 Active cb91935f-170d-4ce5-afb3-e97acf52c922 Node2 STAR star-w2.fabric-testbed.net 2 8 10 default_ubuntu_22 2001:400:a100:3030:f816:3eff:fe83:3c29 Active Time to stable 197 seconds Running post_boot_config ... Time to post boot config 204 seconds Name Node Network Bandwidth VLAN MAC Physical OS Interface OS Interface ------------- ------ --------- ----------- ------ ----------------- ----------------------- -------------- Node1-nic1-p1 Node1 net1 0 02:96:1D:40:C6:BB ens7 ens7 Node2-nic2-p1 Node2 net1 0
I’ve also had only the second NIC component get a MAC address, or neither.
Is there something I should be doing to prevent this, or is this a bug? Thanks.
July 19, 2022 at 9:51 am #2378I’m trying to recreate this but can not seem to intentionally trigger it. My guess is that is an issue related to the library having a temporary problem creating an ssh connection to the VM.
One thing you can try is to manually re-run the post_boot_config step when you see this. You can do this by calling
slice.post_boot_config()
. If this fixes the slice, then this is probably a temporary ssh issue.Another thing to do is to look at the log file. By default it is at
/tmp/fablib/fablib.log
. There might be something in there that hints at what is happening. Be warned that fablib retries the ssh connection a few times on failure, so you may see ssh failures that were resolved.If you do see this again, could you try to include any relevant section of the log file in the message?
July 27, 2022 at 1:20 pm #2543Hi, I know it’s been a while since I posted this but I wanted to update because this problem seems to have gotten worse (or maybe I’m just getting unlucky?) and I finally found the log file. I ran my slice setup and got this output in the notebook:
--------------- ------------------------------------ Slice Name TestModel Slice ID 29726f95-fb45-4c94-81a8-01d5e89d32ef Slice State StableOK Lease End (UTC) 2022-07-28 18:10:30 +0000 --------------- ------------------------------------ Retry: 12, Time: 140 sec ID Name Site Host Cores RAM Disk Image Management IP State Error ------------------------------------ ------ ------ -------------------------- ------- ----- ------ ----------------- -------------------------------------- ------- ------- 3d40f9a1-0d3c-4e31-b727-883d3331bda9 Node1 STAR star-w2.fabric-testbed.net 2 8 100 default_ubuntu_20 2001:400:a100:3030:f816:3eff:fe6f:5e32 Active 09f6a983-004e-4239-b27a-8fda35ae7597 Node2 STAR star-w2.fabric-testbed.net 2 8 100 default_ubuntu_20 2001:400:a100:3030:f816:3eff:feec:63f8 Active Time to stable 140 seconds Running post_boot_config ... Time to post boot config 148 seconds Name Node Network Bandwidth VLAN MAC Physical OS Interface OS Interface ------------- ------ --------- ----------- ------ ----- ----------------------- -------------- Node1-nic1-p1 Node1 net1 0 Node2-nic2-p1 Node2 net1 0 Time to print interfaces 153 seconds
I checked the logs, and here’s what they say from the time I ran my code:
[18:10:29] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/node.py:144} INFO - Adding node: Node1, slice: TestModel, site: STAR [18:10:29] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/node.py:144} INFO - Adding node: Node2, slice: TestModel, site: STAR [18:10:29] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/network_service.py:295} INFO - Create Network Service: Slice: TestModel, Network Name: net1, Type: L2Bridge [18:10:29] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/network_service.py:590} WARNING - Failed to get reservation_id: 'NoneType' object has no attribute 'reservation_id' [18:12:55] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:1120} INFO - post_boot_config: slice_name: TestModel, slice_id 29726f95-fb45-4c94-81a8-01d5e89d32ef [18:12:55] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:1124} INFO - Starting thread: Node1_network_manager_stop [18:12:55] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:1124} INFO - Starting thread: Node2_network_manager_stop [18:12:56] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/node.py:1220} INFO - Stopped NetworkManager with 'sudo systemctl stop NetworkManager': stdout: stderr: Failed to stop NetworkManager.service: Unit NetworkManager.service not loaded. [18:12:56] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/node.py:1220} INFO - Stopped NetworkManager with 'sudo systemctl stop NetworkManager': stdout: stderr: Failed to stop NetworkManager.service: Unit NetworkManager.service not loaded. [18:13:06] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:163} INFO - Starting get network name thread for iface Node1-nic1-p1 [18:13:06] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:167} INFO - Starting get node name thread for iface Node1-nic1-p1 [18:13:06] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:170} INFO - Starting get physical_os_interface_name_threads for iface Node1-nic1-p1 [18:13:06] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:173} INFO - Starting get get_os_interface_threads for iface Node1-nic1-p1 [18:13:06] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:163} INFO - Starting get network name thread for iface Node2-nic2-p1 [18:13:06] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:167} INFO - Starting get node name thread for iface Node2-nic2-p1 [18:13:06] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:170} INFO - Starting get physical_os_interface_name_threads for iface Node2-nic2-p1 [18:13:06] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:173} INFO - Starting get get_os_interface_threads for iface Node2-nic2-p1 [18:13:08] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:182} INFO - Getting results from get network name thread for iface Node1-nic1-p1 [18:13:08] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:189} INFO - Getting results from get node name thread for iface Node1-nic1-p1 [18:13:08] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:182} INFO - Getting results from get network name thread for iface Node2-nic2-p1 [18:13:08] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:189} INFO - Getting results from get node name thread for iface Node2-nic2-p1
July 27, 2022 at 2:31 pm #2544Do you always use STAR? I think this might be a problem with one of the RTX6000 GPUs at STAR. I suspect you only get this error when your VM is placed on star-w2. The error is probably happening more now because most of the RTX6000’s at STAR are allocated and you are more likely to get the bad one.
For now try using a different site. I will have someone look at that GPU and see what is wrong with it.
July 27, 2022 at 2:36 pm #2545I’ll try that, but I just want to make clear that it’s the NICs that are failing to be gotten correctly. The GPUs work fine.
July 27, 2022 at 2:44 pm #2546I understand. I was able to repeat the problem with the NICs but only when the RTX6000 is added. It has something to do with the GPU. I’m not sure why this happens this way but it has been reported to the developers.
Thanks for reporting this.
August 5, 2022 at 10:24 am #2584Xander,
It took a while to track this down but we found the bug that is causing this. A fix has been pushed to the production sites and we think you won’t see this anymore.
Keep trying this slices and please let us know if you see this error again.
thanks for reporting this bug in the forums.
Paul
-
AuthorPosts
- You must be logged in to reply to this topic.