1. When Creating a Slice, Sometimes Fails to Get NIC Components Correctly

When Creating a Slice, Sometimes Fails to Get NIC Components Correctly

Home Forums FABRIC General Questions and Discussion When Creating a Slice, Sometimes Fails to Get NIC Components Correctly

Viewing 7 posts - 1 through 7 (of 7 total)
  • Author
    Posts
  • #2376

    I’ve been using a Jupyter notebook to make a 2-node slice with NIC components on each node, and I noticed that, quite frequently, the node will be built but the NIC components will have no mac addresses and be unresponsive.

    Here’s the code snippet where I’m building the slice:

    
    try:
    #Create Slice
    slice = fablib.new_slice(name=slice_name)
    
    # Node1
    node1 = slice.add_node(name=node1_name, site=site, image='default_ubuntu_22')
    node1.add_component(model='GPU_RTX6000', name=node1_gpu_name)
    node1.set_capacities(cores=2, ram=8, disk=10)
    iface1 = node1.add_component(model='NIC_Basic', name=node1_nic_name).get_interfaces()[0]
    
    # Node2
    node2 = slice.add_node(name=node2_name, site=site, image='default_ubuntu_22')
    node2.set_capacities(cores=2, ram=8, disk=10)
    node2.add_component(model='GPU_RTX6000', name=node2_gpu_name)
    iface2 = node2.add_component(model='NIC_Basic', name=node2_nic_name).get_interfaces()[0]
    
    # Network
    net1 = slice.add_l2network(name=network_name, interfaces=[iface1, iface2])
    
    #Submit Slice Request
    slice.submit()
    except Exception as e:
    print(f"Exception: {e}")
    

    And here’s the output:

    
    ----------- ------------------------------------
    Slice Name ToyModel
    Slice ID fba93c48-c269-41f0-8ab9-c2a4c727490a
    Slice State StableOK
    Lease End 2022-07-19 19:02:44 +0000
    ----------- ------------------------------------
    
    Retry: 16, Time: 197 sec
    
    ID Name Site Host Cores RAM Disk Image Management IP State Error
    ------------------------------------ ------ ------ -------------------------- ------- ----- ------ ----------------- -------------------------------------- ------- -------
    af87b01f-bce6-44e6-8644-632b24ef5da1 Node1 STAR star-w1.fabric-testbed.net 2 8 10 default_ubuntu_22 2001:400:a100:3030:f816:3eff:feae:5e3 Active
    cb91935f-170d-4ce5-afb3-e97acf52c922 Node2 STAR star-w2.fabric-testbed.net 2 8 10 default_ubuntu_22 2001:400:a100:3030:f816:3eff:fe83:3c29 Active
    
    Time to stable 197 seconds
    Running post_boot_config ... Time to post boot config 204 seconds
    
    Name Node Network Bandwidth VLAN MAC Physical OS Interface OS Interface
    ------------- ------ --------- ----------- ------ ----------------- ----------------------- --------------
    Node1-nic1-p1 Node1 net1 0 02:96:1D:40:C6:BB ens7 ens7
    Node2-nic2-p1 Node2 net1 0
    

    I’ve also had only the second NIC component get a MAC address, or neither.

    Is there something I should be doing to prevent this, or is this a bug? Thanks.

    #2378
    Paul Ruth
    Keymaster

      I’m trying to recreate this but can not seem to intentionally trigger it. My guess is that is an issue related to the library having a temporary problem creating an ssh connection to the VM.

      One thing you can try is to manually re-run the post_boot_config step when you see this. You can do this by calling slice.post_boot_config(). If this fixes the slice, then this is probably a temporary ssh issue.

      Another thing to do is to look at the log file. By default it is at /tmp/fablib/fablib.log. There might be something in there that hints at what is happening. Be warned that fablib retries the ssh connection a few times on failure, so you may see ssh failures that were resolved.

      If you do see this again, could you try to include any relevant section of the log file in the message?

      #2543

      Hi, I know it’s been a while since I posted this but I wanted to update because this problem seems to have gotten worse (or maybe I’m just getting unlucky?) and I finally found the log file. I ran my slice setup and got this output in the notebook:

      
      --------------- ------------------------------------
      Slice Name TestModel
      Slice ID 29726f95-fb45-4c94-81a8-01d5e89d32ef
      Slice State StableOK
      Lease End (UTC) 2022-07-28 18:10:30 +0000
      --------------- ------------------------------------
      
      Retry: 12, Time: 140 sec
      
      ID Name Site Host Cores RAM Disk Image Management IP State Error
      ------------------------------------ ------ ------ -------------------------- ------- ----- ------ ----------------- -------------------------------------- ------- -------
      3d40f9a1-0d3c-4e31-b727-883d3331bda9 Node1 STAR star-w2.fabric-testbed.net 2 8 100 default_ubuntu_20 2001:400:a100:3030:f816:3eff:fe6f:5e32 Active
      09f6a983-004e-4239-b27a-8fda35ae7597 Node2 STAR star-w2.fabric-testbed.net 2 8 100 default_ubuntu_20 2001:400:a100:3030:f816:3eff:feec:63f8 Active
      
      Time to stable 140 seconds
      Running post_boot_config ... Time to post boot config 148 seconds
      
      Name Node Network Bandwidth VLAN MAC Physical OS Interface OS Interface
      ------------- ------ --------- ----------- ------ ----- ----------------------- --------------
      Node1-nic1-p1 Node1 net1 0
      Node2-nic2-p1 Node2 net1 0
      
      Time to print interfaces 153 seconds
      

      I checked the logs, and here’s what they say from the time I ran my code:

      
      [18:10:29] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/node.py:144} INFO - Adding node: Node1, slice: TestModel, site: STAR
      [18:10:29] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/node.py:144} INFO - Adding node: Node2, slice: TestModel, site: STAR
      [18:10:29] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/network_service.py:295} INFO - Create Network Service: Slice: TestModel, Network Name: net1, Type: L2Bridge
      [18:10:29] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/network_service.py:590} WARNING - Failed to get reservation_id: 'NoneType' object has no attribute 'reservation_id'
      [18:12:55] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:1120} INFO - post_boot_config: slice_name: TestModel, slice_id 29726f95-fb45-4c94-81a8-01d5e89d32ef
      [18:12:55] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:1124} INFO - Starting thread: Node1_network_manager_stop
      [18:12:55] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:1124} INFO - Starting thread: Node2_network_manager_stop
      [18:12:56] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/node.py:1220} INFO - Stopped NetworkManager with 'sudo systemctl stop NetworkManager': stdout:
      stderr: Failed to stop NetworkManager.service: Unit NetworkManager.service not loaded.
      
      [18:12:56] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/node.py:1220} INFO - Stopped NetworkManager with 'sudo systemctl stop NetworkManager': stdout:
      stderr: Failed to stop NetworkManager.service: Unit NetworkManager.service not loaded.
      
      [18:13:06] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:163} INFO - Starting get network name thread for iface Node1-nic1-p1
      [18:13:06] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:167} INFO - Starting get node name thread for iface Node1-nic1-p1
      [18:13:06] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:170} INFO - Starting get physical_os_interface_name_threads for iface Node1-nic1-p1
      [18:13:06] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:173} INFO - Starting get get_os_interface_threads for iface Node1-nic1-p1
      [18:13:06] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:163} INFO - Starting get network name thread for iface Node2-nic2-p1
      [18:13:06] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:167} INFO - Starting get node name thread for iface Node2-nic2-p1
      [18:13:06] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:170} INFO - Starting get physical_os_interface_name_threads for iface Node2-nic2-p1
      [18:13:06] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:173} INFO - Starting get get_os_interface_threads for iface Node2-nic2-p1
      [18:13:08] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:182} INFO - Getting results from get network name thread for iface Node1-nic1-p1
      [18:13:08] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:189} INFO - Getting results from get node name thread for iface Node1-nic1-p1
      [18:13:08] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:182} INFO - Getting results from get network name thread for iface Node2-nic2-p1
      [18:13:08] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:189} INFO - Getting results from get node name thread for iface Node2-nic2-p1
      
      #2544
      Paul Ruth
      Keymaster

        Do you always use STAR? I think this might be a problem with one of the RTX6000 GPUs at STAR.  I suspect you only get this error when your VM is placed on star-w2.  The error is probably happening more now because most of the RTX6000’s at STAR are allocated and you are more likely to get the bad one.

        For now try using a different site. I will have someone look at that GPU and see what is wrong with it.

         

         

        #2545

        I’ll try that, but I just want to make clear that it’s the NICs that are failing to be gotten correctly. The GPUs work fine.

        #2546
        Paul Ruth
        Keymaster

          I understand. I was able to repeat the problem with the NICs but only when the RTX6000 is added.  It has something to do with the GPU.  I’m not sure why this happens this way but it has been reported to the developers.

          Thanks for reporting this.

          #2584
          Paul Ruth
          Keymaster

            Xander,

            It took a while to track this down but we found the bug that is causing this.  A fix has been pushed to the production sites and we think you won’t see this anymore.

            Keep trying this slices and please let us know if you see this error again.

            thanks for reporting this bug in the forums.

            Paul

          Viewing 7 posts - 1 through 7 (of 7 total)
          • You must be logged in to reply to this topic.