1. Xander Maddox Weintraut

Xander Maddox Weintraut

Forum Replies Created

Viewing 12 posts - 1 through 12 (of 12 total)
  • Author
    Posts
  • I’ll try that, but I just want to make clear that it’s the NICs that are failing to be gotten correctly. The GPUs work fine.

    Hi, I know it’s been a while since I posted this but I wanted to update because this problem seems to have gotten worse (or maybe I’m just getting unlucky?) and I finally found the log file. I ran my slice setup and got this output in the notebook:

    
    --------------- ------------------------------------
    Slice Name TestModel
    Slice ID 29726f95-fb45-4c94-81a8-01d5e89d32ef
    Slice State StableOK
    Lease End (UTC) 2022-07-28 18:10:30 +0000
    --------------- ------------------------------------
    
    Retry: 12, Time: 140 sec
    
    ID Name Site Host Cores RAM Disk Image Management IP State Error
    ------------------------------------ ------ ------ -------------------------- ------- ----- ------ ----------------- -------------------------------------- ------- -------
    3d40f9a1-0d3c-4e31-b727-883d3331bda9 Node1 STAR star-w2.fabric-testbed.net 2 8 100 default_ubuntu_20 2001:400:a100:3030:f816:3eff:fe6f:5e32 Active
    09f6a983-004e-4239-b27a-8fda35ae7597 Node2 STAR star-w2.fabric-testbed.net 2 8 100 default_ubuntu_20 2001:400:a100:3030:f816:3eff:feec:63f8 Active
    
    Time to stable 140 seconds
    Running post_boot_config ... Time to post boot config 148 seconds
    
    Name Node Network Bandwidth VLAN MAC Physical OS Interface OS Interface
    ------------- ------ --------- ----------- ------ ----- ----------------------- --------------
    Node1-nic1-p1 Node1 net1 0
    Node2-nic2-p1 Node2 net1 0
    
    Time to print interfaces 153 seconds
    

    I checked the logs, and here’s what they say from the time I ran my code:

    
    [18:10:29] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/node.py:144} INFO - Adding node: Node1, slice: TestModel, site: STAR
    [18:10:29] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/node.py:144} INFO - Adding node: Node2, slice: TestModel, site: STAR
    [18:10:29] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/network_service.py:295} INFO - Create Network Service: Slice: TestModel, Network Name: net1, Type: L2Bridge
    [18:10:29] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/network_service.py:590} WARNING - Failed to get reservation_id: 'NoneType' object has no attribute 'reservation_id'
    [18:12:55] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:1120} INFO - post_boot_config: slice_name: TestModel, slice_id 29726f95-fb45-4c94-81a8-01d5e89d32ef
    [18:12:55] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:1124} INFO - Starting thread: Node1_network_manager_stop
    [18:12:55] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:1124} INFO - Starting thread: Node2_network_manager_stop
    [18:12:56] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/node.py:1220} INFO - Stopped NetworkManager with 'sudo systemctl stop NetworkManager': stdout:
    stderr: Failed to stop NetworkManager.service: Unit NetworkManager.service not loaded.
    
    [18:12:56] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/node.py:1220} INFO - Stopped NetworkManager with 'sudo systemctl stop NetworkManager': stdout:
    stderr: Failed to stop NetworkManager.service: Unit NetworkManager.service not loaded.
    
    [18:13:06] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:163} INFO - Starting get network name thread for iface Node1-nic1-p1
    [18:13:06] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:167} INFO - Starting get node name thread for iface Node1-nic1-p1
    [18:13:06] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:170} INFO - Starting get physical_os_interface_name_threads for iface Node1-nic1-p1
    [18:13:06] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:173} INFO - Starting get get_os_interface_threads for iface Node1-nic1-p1
    [18:13:06] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:163} INFO - Starting get network name thread for iface Node2-nic2-p1
    [18:13:06] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:167} INFO - Starting get node name thread for iface Node2-nic2-p1
    [18:13:06] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:170} INFO - Starting get physical_os_interface_name_threads for iface Node2-nic2-p1
    [18:13:06] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:173} INFO - Starting get get_os_interface_threads for iface Node2-nic2-p1
    [18:13:08] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:182} INFO - Getting results from get network name thread for iface Node1-nic1-p1
    [18:13:08] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:189} INFO - Getting results from get node name thread for iface Node1-nic1-p1
    [18:13:08] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:182} INFO - Getting results from get network name thread for iface Node2-nic2-p1
    [18:13:08] {/opt/conda/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py:189} INFO - Getting results from get node name thread for iface Node2-nic2-p1
    
    in reply to: User is not a member of project: #2398

    That worked! Thank you

    Yeah, here you go:

    
    Traceback (most recent call last):
    File "D:\Research\FABRIC\fabric-scripts\hello_fabric.py", line 37, in
    slice.submit(wait=False)
    File "C:\Users\xwein\AppData\Local\Programs\Python\Python39\lib\site-packages\fabrictestbed_extensions\fablib\slice.py", line 1217, in submit
    self.update()
    File "C:\Users\xwein\AppData\Local\Programs\Python\Python39\lib\site-packages\fabrictestbed_extensions\fablib\slice.py", line 325, in update
    self.update_topology()
    File "C:\Users\xwein\AppData\Local\Programs\Python\Python39\lib\site-packages\fabrictestbed_extensions\fablib\slice.py", line 278, in update_topology
    raise Exception("Failed to get slice topology: {}, {}".format(return_status, new_topo))
    Exception: Failed to get slice topology: Status.FAILURE, Error [Unable to read graph C:\Users\xwein\AppData\Local\Temp\tmpw2z0kyuu-graphml] importing graph
    

    I don’t think so. I get a different error. I made sure that I didn’t have an opened slice called MySlice, then when I ran it I got this:

    (base) fabric@jupyter-xweintra-40purdue-2eedu:~/work$ python hello.py
    Name CPUs Cores RAM (G) Disk (G) Basic (100 Gbps NIC) ConnectX-6 (100 Gbps x2 NIC) ConnectX-5 (25 Gbps x2 NIC) P4510 (NVMe 1TB) Tesla T4 (GPU) RTX6000 (GPU)
    —— —— ——- ——— ————- ———————- —————————— —————————– —————— —————- —————
    MICH 6 190/192 1530/1536 60590/60600 381/381 0/2 2/2 10/10 2/2 3/3
    UTAH 10 320/320 2560/2560 116400/116400 635/635 2/2 4/4 16/16 4/4 5/5
    TACC 10 238/320 2328/2560 115590/116400 632/635 2/2 4/4 16/16 4/4 6/6
    WASH 6 188/192 1520/1536 60580/60600 379/381 2/2 2/2 10/10 2/2 3/3
    NCSA 6 192/192 1536/1536 60600/60600 381/381 2/2 2/2 10/10 2/2 3/3
    DALL 6 192/192 1536/1536 60600/60600 381/381 2/2 2/2 10/10 2/2 3/3
    MAX 10 290/320 2452/2560 116190/116400 619/635 1/2 4/4 16/16 4/4 6/6
    MASS 4 120/128 992/1024 55700/55800 254/254 1/2 0/0 6/6 0/0 3/3
    SALT 6 184/192 1504/1536 60500/60600 380/381 2/2 2/2 10/10 2/2 3/3
    STAR 12 368/384 3008/3072 121060/121200 757/762 2/2 6/6 20/20 6/6 4/6
    Running post boot config … Exception: node.execute: Management IP Invalid: None
    ———– ————————————
    Slice Name MySlice
    Slice ID c26d5e3b-6e81-48f1-b12d-f68a6fbc1ea6
    Slice State Configuring
    Lease End 2022-07-16 15:22:29 +0000
    ———– ————————————
    —————– ———————————————————————————————-
    ID
    Name Node1
    Cores
    RAM
    Disk
    Image default_rocky_8
    Image Type qcow2
    Host
    Site NCSA
    Management IP
    Reservation State
    Error Message
    SSH Command ssh -i /home/fabric/.ssh/id_rsa -J xweintra_0000014567@bastion-1.fabric-testbed.net rocky@None
    —————– ———————————————————————————————-
    Exception: node.execute: Management IP Invalid: None
    Exception: Failed to delete slice: Status.FAILURE, (500)
    Reason: INTERNAL SERVER ERROR
    HTTP response headers: HTTPHeaderDict({‘Server’: ‘nginx/1.21.6’, ‘Date’: ‘Fri, 15 Jul 2022 15:22:31 GMT’, ‘Content-Type’: ‘text/html; charset=utf-8’, ‘Content-Length’: ‘100’, ‘Connection’: ‘keep-alive’, ‘Access-Control-Allow-Credentials’: ‘true’, ‘Access-Control-Allow-Headers’: ‘DNT, User-Agent, X-Requested-With, If-Modified-Since, Cache-Control, Content-Type, Range’, ‘Access-Control-Allow-Methods’: ‘GET, POST, PUT, DELETE, OPTIONS’, ‘Access-Control-Allow-Origin’: ‘*’, ‘Access-Control-Expose-Headers’: ‘Content-Length, Content-Range, X-Error’, ‘X-Error’: ‘Unable to delete Slice# c26d5e3b-6e81-48f1-b12d-f68a6fbc1ea6 that is not yet stable, try again later’})
    HTTP response body: Unable to delete Slice# c26d5e3b-6e81-48f1-b12d-f68a6fbc1ea6 that is not yet stable, try again later

    As you can see, the error is “Management IP Invalid: None” just after running post boot config. Does it also work for you if you try to run the script from Jupyter? That’s where I ran it from.

    I haven’t gotten fabric to work properly from my local computer yet, I get this error, which I have a feeling might be because I’m trying to run it from Windows? I have no clue:

    Failed to get slice topology: Status.FAILURE, Error [Unable to read graph C:\Users\xwein\AppData\Local\Temp\tmprkqs64qf-graphml] importing graph

    Side note, how do I do the quote segment with overflow? I don’t know how to use this markup very well.

    Let’s try this

    I don’t think I have permissions to upload files

    • This reply was modified 1 year, 9 months ago by Xander Maddox Weintraut. Reason: Not allowed to upload .py files apparently. You'll have to resave this as a .py before you can run it
    • This reply was modified 1 year, 9 months ago by Xander Maddox Weintraut. Reason: Can't upload files

    Right, that’s what I did.

    First I made sure the “Hello, FABRIC notebook ran correctly.

    Then I made a python script with all of the code cells copy/pasted directly back-to-back.

    When I ran that script from the terminal, this was the output:

    Name CPUs Cores RAM (G) Disk (G) Basic (100 Gbps NIC) ConnectX-6 (100 Gbps x2 NIC) ConnectX-5 (25 Gbps x2 NIC) P4510 (NVMe 1TB) Tesla T4 (GPU) RTX6000 (GPU)
    —— —— ——- ——— ————- ———————- —————————— —————————– —————— —————- —————
    MICH 6 188/192 1522/1536 60580/60600 381/381 0/2 2/2 10/10 2/2 3/3
    UTAH 10 316/320 2544/2560 116380/116400 634/635 2/2 4/4 16/16 4/4 5/5
    TACC 10 220/320 2256/2560 115390/116400 630/635 2/2 4/4 16/16 4/4 5/6
    WASH 6 188/192 1520/1536 60580/60600 379/381 2/2 2/2 10/10 2/2 3/3
    NCSA 6 192/192 1536/1536 60600/60600 381/381 2/2 2/2 10/10 2/2 3/3
    DALL 6 192/192 1536/1536 60600/60600 381/381 2/2 2/2 10/10 2/2 3/3
    MAX 10 254/320 2332/2560 115920/116400 594/635 0/2 2/4 16/16 4/4 6/6
    MASS 4 118/128 984/1024 55690/55800 253/254 1/2 0/0 6/6 0/0 3/3
    SALT 6 192/192 1536/1536 60600/60600 381/381 2/2 2/2 10/10 2/2 3/3
    STAR 12 366/384 3000/3072 121090/121200 760/762 2/2 6/6 20/20 6/6 6/6
    Running post boot config … Exception: node.execute: Management IP Invalid: None
    ———– ————————————
    Slice Name MySlice
    Slice ID b73f5090-e56a-474f-997a-16f6f7681952
    Slice State Configuring
    Lease End 2022-07-15 20:03:36 +0000
    ———– ————————————
    —————– ———————————————————————————————-
    ID
    Name Node1
    Cores
    RAM
    Disk
    Image default_rocky_8
    Image Type qcow2
    Host
    Site TACC
    Management IP
    Reservation State
    Error Message
    SSH Command ssh -i /home/fabric/.ssh/id_rsa -J xweintra_0000014567@bastion-1.fabric-testbed.net rocky@None
    —————– ———————————————————————————————-
    Exception: node.execute: Management IP Invalid: None
    Exception: Failed to delete slice: Status.FAILURE, (500)
    Reason: INTERNAL SERVER ERROR
    HTTP response headers: HTTPHeaderDict({‘Server’: ‘nginx/1.21.6’, ‘Date’: ‘Thu, 14 Jul 2022 20:03:39 GMT’, ‘Content-Type’: ‘text/html; charset=utf-8’, ‘Content-Length’: ‘100’, ‘Connection’: ‘keep-alive’, ‘Access-Control-Allow-Credentials’: ‘true’, ‘Access-Control-Allow-Headers’: ‘DNT, User-Agent, X-Requested-With, If-Modified-Since, Cache-Control, Content-Type, Range’, ‘Access-Control-Allow-Methods’: ‘GET, POST, PUT, DELETE, OPTIONS’, ‘Access-Control-Allow-Origin’: ‘*’, ‘Access-Control-Expose-Headers’: ‘Content-Length, Content-Range, X-Error’, ‘X-Error’: ‘Unable to delete Slice# b73f5090-e56a-474f-997a-16f6f7681952 that is not yet stable, try again later’})
    HTTP response body: Unable to delete Slice# b73f5090-e56a-474f-997a-16f6f7681952 that is not yet stable, try again later

    The Errors after the “Running post boot config…” line are because the submit() call throws an exception before it finishes, so the later calls are trying to act on a slice that is not stable yet.

    The slice does eventually reach StableOK state, but it has no nodes.

    I am not having any issues running the notebook. Only with running .py scripts

    I’m in ULTIMA. I don’t think we need any more tags at the moment, but will let you know as the need arises. However, all of the Networking examples after “Create a Local Ethernet (Layer 2)” require the Slice.Multisite tag to run.

    Regardless, I’m fairly sure that permissions tags aren’t the issue here.

    The notebook runs just fine. The only notebooks that have failed have been ones that require project tags I don’t have.

    I just went through the list of sites, and was able to reproduce the issue with every site.

Viewing 12 posts - 1 through 12 (of 12 total)