1. Paul Ruth

Paul Ruth

Forum Replies Created

Viewing 15 posts - 151 through 165 (of 273 total)
  • Author
    Posts
  • Paul Ruth
    Keymaster

      This might be Windows issue. I’m going to have to have some other people look at it. Is there any way you could reproduce that graphml error and include a full stack trace? That might help us track this down.

      Re: Code in a forum post. Clickt the “Text” tab next to the “Visual” tab that in top right of the box that you are typing in. The click the “CODE” button and it will insert a then add your code, then click the “/CODE” button to insert another.  Anything between the `s will be in the box that my code was in.

      • This reply was modified 2 years, 4 months ago by Paul Ruth.
      in reply to: get_physical_os_interface()[‘ifname’] failed #2322
      Paul Ruth
      Keymaster

        We are still working on tuning all the links and trying to figure out best practices for achieving very high bandwidths. There are no artificial limitations on that link and, in theory, 100G is possible. This is just going to require a bunch of tuning, both on the edge and in probably in the core.

        I know some of our students where looking at this and achieved ~100G between pairs of sites that are closer to each other. I’m not sure what the current best bandwidth achieved is on the longer spans, but I remember seeing them getting at least 30G for some tests. We would be interested in knowing about any successes you have with achieving higher bandwidths.

        What tuning did you perform in your nodes?

        In general, there are a lot of variable that can prevent bandwidths at these rates. You might reduce some of those variable by starting with a pair of sites that a close to each other (maybe UTAH/SALT) and use dedicated connectX-6 cards.

        Your UDP test has a 98% loss. Given that the card is a 100G card it can easily overwhelm an intermediary switch which can result in huge packet losses like that. You might try UDP test with lower bandwidths and slowly increase the bandwidth until you packet loss starts to grow. Then try different tuning parameters to see if you can get it higher.

        I’m going to see if one of our student who was working on this can add an more here…

        Paul Ruth
        Keymaster

          It works for me but it didn’t work the first time I tried it. The error I got the first time might be your problem too.

          The first time I ran it I got this:

          pruth@pruth-laptop Desktop % python3 hello_edited.py
          Name CPUs Cores RAM (G) Disk (G) Basic (100 Gbps NIC) ConnectX-6 (100 Gbps x2 NIC) ConnectX-5 (25 Gbps x2 NIC) P4510 (NVMe 1TB) Tesla T4 (GPU) RTX6000 (GPU)
          ------ ------ ------- --------- ------------- ---------------------- ------------------------------ ----------------------------- ------------------ ---------------- ---------------
          MICH 6 190/192 1530/1536 60590/60600 381/381 0/2 2/2 10/10 2/2 3/3
          UTAH 10 320/320 2560/2560 116400/116400 635/635 2/2 4/4 16/16 4/4 5/5
          TACC 10 238/320 2328/2560 115590/116400 632/635 2/2 4/4 16/16 4/4 6/6
          WASH 6 188/192 1520/1536 60580/60600 379/381 2/2 2/2 10/10 2/2 3/3
          NCSA 6 192/192 1536/1536 60600/60600 381/381 2/2 2/2 10/10 2/2 3/3
          DALL 6 190/192 1528/1536 60590/60600 381/381 2/2 2/2 10/10 2/2 3/3
          MAX 10 290/320 2452/2560 116190/116400 619/635 1/2 4/4 16/16 4/4 6/6
          MASS 4 120/128 992/1024 55700/55800 254/254 1/2 0/0 6/6 0/0 3/3
          SALT 6 184/192 1504/1536 60500/60600 380/381 2/2 2/2 10/10 2/2 3/3
          STAR 12 368/384 3008/3072 121060/121200 757/762 2/2 6/6 20/20 6/6 4/6
          Exception: Failed to submit slice: Status.FAILURE, (500)
          Reason: INTERNAL SERVER ERROR
          HTTP response headers: HTTPHeaderDict({'Server': 'nginx/1.21.6', 'Date': 'Fri, 15 Jul 2022 15:08:55 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Content-Length': '28', 'Connection': 'keep-alive', 'Access-Control-Allow-Credentials': 'true', 'Access-Control-Allow-Headers': 'DNT, User-Agent, X-Requested-With, If-Modified-Since, Cache-Control, Content-Type, Range', 'Access-Control-Allow-Methods': 'GET, POST, PUT, DELETE, OPTIONS', 'Access-Control-Allow-Origin': '*', 'Access-Control-Expose-Headers': 'Content-Length, Content-Range, X-Error', 'X-Error': 'Slice MySlice already exists'})
          HTTP response body: Slice MySlice already exists
          
          Exception: 'NoneType' object has no attribute 'slice_name'
          ----------------- --------------------------------------------------------------------------------------------------------------------
          ID
          Name Node1
          Cores
          RAM
          Disk
          Image default_rocky_8
          Image Type qcow2
          Host
          Site UTAH
          Management IP
          Reservation State
          Error Message
          SSH Command ssh -i /Users/pruth/work/fabric_config/slice-private-key -J pruth_0031379841@bastion-1.fabric-testbed.net rocky@None
          ----------------- --------------------------------------------------------------------------------------------------------------------
          Exception: node.execute: Management IP Invalid: None
          Exception: Failed to delete slice: Status.INVALID_ARGUMENTS, Invalid arguments
          pruth@pruth-laptop Desktop %
          

          Notice the error in the middle that says “HTTP response body: Slice MySlice already exists”.  This is because I already had a slice called “MySlice”.   I deleted that slice and re-ran your script and it worked.  This was the result:

          pruth@pruth-laptop Desktop % python3 hello_edited.py
          Name      CPUs  Cores    RAM (G)    Disk (G)       Basic (100 Gbps NIC)    ConnectX-6 (100 Gbps x2 NIC)    ConnectX-5 (25 Gbps x2 NIC)    P4510 (NVMe 1TB)    Tesla T4 (GPU)    RTX6000 (GPU)
          ------  ------  -------  ---------  -------------  ----------------------  ------------------------------  -----------------------------  ------------------  ----------------  ---------------
          MICH         6  190/192  1530/1536  60590/60600    381/381                 0/2                             2/2                            10/10               2/2               3/3
          UTAH        10  320/320  2560/2560  116400/116400  635/635                 2/2                             4/4                            16/16               4/4               5/5
          TACC        10  238/320  2328/2560  115590/116400  632/635                 2/2                             4/4                            16/16               4/4               6/6
          WASH         6  188/192  1520/1536  60580/60600    379/381                 2/2                             2/2                            10/10               2/2               3/3
          NCSA         6  192/192  1536/1536  60600/60600    381/381                 2/2                             2/2                            10/10               2/2               3/3
          DALL         6  190/192  1528/1536  60590/60600    381/381                 2/2                             2/2                            10/10               2/2               3/3
          MAX         10  290/320  2452/2560  116190/116400  619/635                 1/2                             4/4                            16/16               4/4               6/6
          MASS         4  120/128  992/1024   55700/55800    254/254                 1/2                             0/0                            6/6                 0/0               3/3
          SALT         6  184/192  1504/1536  60500/60600    380/381                 2/2                             2/2                            10/10               2/2               3/3
          STAR        12  368/384  3008/3072  121060/121200  757/762                 2/2                             6/6                            20/20               6/6               4/6
          
          Waiting for slice ........... Slice state: StableOK
          Waiting for ssh in slice .. ssh successful
          Running post boot config ... Done!
          ---------------  ------------------------------------
          Slice Name       MySlice
          Slice ID         fba02fd7-423e-4309-9954-c3cbff38870a
          Slice State      StableOK
          Lease End (UTC)  2022-07-16 15:11:53 +0000
          ---------------  ------------------------------------
          -----------------  ------------------------------------------------------------------------------------------------------------------------------------------------------
          ID                 59eda82a-b9b7-4670-b830-40cff59e18cc
          Name               Node1
          Cores              2
          RAM                8
          Disk               10
          Image              default_rocky_8
          Image Type         qcow2
          Host               dall-w3.fabric-testbed.net
          Site               DALL
          Management IP      2001:400:a100:3000:f816:3eff:fe7e:5477
          Reservation State  Active
          Error Message
          SSH Command        ssh -i /Users/pruth/work/fabric_config/slice-private-key -J pruth_0031379841@bastion-1.fabric-testbed.net rocky@2001:400:a100:3000:f816:3eff:fe7e:5477
          -----------------  ------------------------------------------------------------------------------------------------------------------------------------------------------
          Hello, FABRIC from node 59eda82a-b9b7-4670-b830-40cff59e18cc-node1
          
          pruth@pruth-laptop Desktop % 
          

          Is this your issue too?
           

          • This reply was modified 2 years, 4 months ago by Paul Ruth.
          • This reply was modified 2 years, 4 months ago by Paul Ruth.
          Paul Ruth
          Keymaster

            I think I fixed it so you can attach .py and .txt file.  Can you try again?

             

            in reply to: cannot login to reserved nodes #2298
            Paul Ruth
            Keymaster

              Look at the example notebook called “Bastion Keypair”.  It sets up a ssh  config file that is necessary for ssh’ing from a command line.  You can add the path to your bastion key and your bastion user id to this notebook. Then run the notebook and it will create the correct ssh config file.

              This is an initial response to a quirk in command line ssh when jumping through a host with -J.  For some reason you cannot pass the bastion host key on the command line.  The only way to do this is to have the bastion private key in a keychain or in the ssh config file.   SSHing from inside a notebook uses paramiko and does not need the ssh config file.

              Very soon we will release a new version of fablib that will streamline a bunch of config including this issue.

              Paul Ruth
              Keymaster

                Can you send me the python file you are using so I can try to recreate this issue?

                Paul Ruth
                Keymaster

                  You have the project ID set to the name of your project.  It should be set to the guid that can be found on the project’s page in the portal.  For example, the project ID for the FABRIC Tutorials project is circled in the attached image.

                  Paul

                   

                  • This reply was modified 2 years, 4 months ago by Paul Ruth.
                  • This reply was modified 2 years, 4 months ago by Paul Ruth.
                  Paul Ruth
                  Keymaster

                    I created a note for the developers.

                    thanks,

                    Paul

                     

                    Paul Ruth
                    Keymaster

                      The Jupyter notebooks are just python but it allows you to run them one cell at a time. Can you cut/paste the code from the cells of “Hello, FABRIC” notebook to a .py script and run it?  As long as your env vars and python libraries are setup correctly it should work.

                      in reply to: get_physical_os_interface()[‘ifname’] failed #2283
                      Paul Ruth
                      Keymaster

                        I think some of those debugging notebooks are old and maybe don’t work anymore.

                        You can use the 100G networks by just creating a WAN link that connects VMs using 100G NICs.   Any of the regular networking notebooks should work for this.  The only thing to think about is that, for now, dedicated quality of service guarantees are not available.  However, very little bandwidth is currently being used and you should not be limited by other users.

                        That said, we have only begun testing most of the links and have not confirmed the bandwidth we can achieve.  In theory, most of them should be able to get 100G but I suspect most of them will need some tuning. Please try this and let us know what you can achieve.

                        thanks,

                        Paul

                        • This reply was modified 2 years, 4 months ago by Paul Ruth.
                        Paul Ruth
                        Keymaster

                          Are you still having issues running your notebook?

                          Paul Ruth
                          Keymaster

                            Which tags do you need? Which project?

                            Paul

                            Paul Ruth
                            Keymaster

                              I’m not sure what the problem is. When I try the code you posted it works.   I think this means it has something to do with your configuration.  Are you able to run the “Hello, FABRIC” notebook?   That one is, basically, a test that confirms the configuration is correct.

                              Paul Ruth
                              Keymaster

                                Which project are you working on? A FABRIC admin needs to set the tag.

                                thanks,

                                Paul

                                Paul Ruth
                                Keymaster

                                  We are working on better error messages but for now ‘Management IP Invalid: None’ is a bit of generic fail message. It means that the VM didn’t get a Management IP assigned to it.  In practice, this is the result of an uncaught VM failure, often related to errors in assigning IPs but sometime other things.

                                  It is difficult to say what is causing this specific error but we seem to see this occasionally when a site is having issues starting VMs.  You might try to resubmit the slice but on a different site.  In your case you are using a random site so it may be as easy are retrying the same request.  It would also be useful if you let us know which site you are seeing in this on when it happens.

                                  Paul

                                   

                                Viewing 15 posts - 151 through 165 (of 273 total)