1. Paul Ruth

Paul Ruth

Forum Replies Created

Viewing 15 posts - 151 through 165 (of 275 total)
  • Author
    Posts
  • in reply to: get_physical_os_interface()[‘ifname’] failed #2330
    Paul Ruth
    Keymaster

      Those are all great resources.

      mtu = 9000 (jumbo frames) is important too.

      With 100G NICs this part from fasterdata.es.net is important too:

      We also strongly recommend reducing the maximum flow rate to avoid bursts of packets that could overflow switch and receive host buffers.
      
      For example for a 10G host, add this to a boot script:
      
      /sbin/tc qdisc add dev ethN root fq maxrate 8gbit
      For for a host running data transfer tools that use 4 parallel streams, do this:
      
      /sbin/tc qdisc add dev ethN root fq maxrate 2gbit
      Where 'ethN' is the name of the ethernet device on your system.

       

       

       

       

      in reply to: cannot login to reserved nodes #2329
      Paul Ruth
      Keymaster

        You are not allowed to ssh to our bastion host directly. You can only jump through it.  The node.execute call does this for you using paramiko.   It does this by creating a “channel” using paramiko.

        If you want to replicate this in your own code the example is here: https://github.com/fabric-testbed/fabrictestbed-extensions/blob/30175ec0c5d05d93000443448d8abfb554f99c7c/fabrictestbed_extensions/fablib/node.py#L646

        Paul

        Paul Ruth
        Keymaster

          This might be Windows issue. I’m going to have to have some other people look at it. Is there any way you could reproduce that graphml error and include a full stack trace? That might help us track this down.

          Re: Code in a forum post. Clickt the “Text” tab next to the “Visual” tab that in top right of the box that you are typing in. The click the “CODE” button and it will insert a then add your code, then click the “/CODE” button to insert another.  Anything between the `s will be in the box that my code was in.

          • This reply was modified 3 years, 6 months ago by Paul Ruth.
          in reply to: get_physical_os_interface()[‘ifname’] failed #2322
          Paul Ruth
          Keymaster

            We are still working on tuning all the links and trying to figure out best practices for achieving very high bandwidths. There are no artificial limitations on that link and, in theory, 100G is possible. This is just going to require a bunch of tuning, both on the edge and in probably in the core.

            I know some of our students where looking at this and achieved ~100G between pairs of sites that are closer to each other. I’m not sure what the current best bandwidth achieved is on the longer spans, but I remember seeing them getting at least 30G for some tests. We would be interested in knowing about any successes you have with achieving higher bandwidths.

            What tuning did you perform in your nodes?

            In general, there are a lot of variable that can prevent bandwidths at these rates. You might reduce some of those variable by starting with a pair of sites that a close to each other (maybe UTAH/SALT) and use dedicated connectX-6 cards.

            Your UDP test has a 98% loss. Given that the card is a 100G card it can easily overwhelm an intermediary switch which can result in huge packet losses like that. You might try UDP test with lower bandwidths and slowly increase the bandwidth until you packet loss starts to grow. Then try different tuning parameters to see if you can get it higher.

            I’m going to see if one of our student who was working on this can add an more here…

            Paul Ruth
            Keymaster

              It works for me but it didn’t work the first time I tried it. The error I got the first time might be your problem too.

              The first time I ran it I got this:

              pruth@pruth-laptop Desktop % python3 hello_edited.py
              Name CPUs Cores RAM (G) Disk (G) Basic (100 Gbps NIC) ConnectX-6 (100 Gbps x2 NIC) ConnectX-5 (25 Gbps x2 NIC) P4510 (NVMe 1TB) Tesla T4 (GPU) RTX6000 (GPU)
              ------ ------ ------- --------- ------------- ---------------------- ------------------------------ ----------------------------- ------------------ ---------------- ---------------
              MICH 6 190/192 1530/1536 60590/60600 381/381 0/2 2/2 10/10 2/2 3/3
              UTAH 10 320/320 2560/2560 116400/116400 635/635 2/2 4/4 16/16 4/4 5/5
              TACC 10 238/320 2328/2560 115590/116400 632/635 2/2 4/4 16/16 4/4 6/6
              WASH 6 188/192 1520/1536 60580/60600 379/381 2/2 2/2 10/10 2/2 3/3
              NCSA 6 192/192 1536/1536 60600/60600 381/381 2/2 2/2 10/10 2/2 3/3
              DALL 6 190/192 1528/1536 60590/60600 381/381 2/2 2/2 10/10 2/2 3/3
              MAX 10 290/320 2452/2560 116190/116400 619/635 1/2 4/4 16/16 4/4 6/6
              MASS 4 120/128 992/1024 55700/55800 254/254 1/2 0/0 6/6 0/0 3/3
              SALT 6 184/192 1504/1536 60500/60600 380/381 2/2 2/2 10/10 2/2 3/3
              STAR 12 368/384 3008/3072 121060/121200 757/762 2/2 6/6 20/20 6/6 4/6
              Exception: Failed to submit slice: Status.FAILURE, (500)
              Reason: INTERNAL SERVER ERROR
              HTTP response headers: HTTPHeaderDict({'Server': 'nginx/1.21.6', 'Date': 'Fri, 15 Jul 2022 15:08:55 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Content-Length': '28', 'Connection': 'keep-alive', 'Access-Control-Allow-Credentials': 'true', 'Access-Control-Allow-Headers': 'DNT, User-Agent, X-Requested-With, If-Modified-Since, Cache-Control, Content-Type, Range', 'Access-Control-Allow-Methods': 'GET, POST, PUT, DELETE, OPTIONS', 'Access-Control-Allow-Origin': '*', 'Access-Control-Expose-Headers': 'Content-Length, Content-Range, X-Error', 'X-Error': 'Slice MySlice already exists'})
              HTTP response body: Slice MySlice already exists
              
              Exception: 'NoneType' object has no attribute 'slice_name'
              ----------------- --------------------------------------------------------------------------------------------------------------------
              ID
              Name Node1
              Cores
              RAM
              Disk
              Image default_rocky_8
              Image Type qcow2
              Host
              Site UTAH
              Management IP
              Reservation State
              Error Message
              SSH Command ssh -i /Users/pruth/work/fabric_config/slice-private-key -J pruth_0031379841@bastion-1.fabric-testbed.net rocky@None
              ----------------- --------------------------------------------------------------------------------------------------------------------
              Exception: node.execute: Management IP Invalid: None
              Exception: Failed to delete slice: Status.INVALID_ARGUMENTS, Invalid arguments
              pruth@pruth-laptop Desktop %
              

              Notice the error in the middle that says “HTTP response body: Slice MySlice already exists”.  This is because I already had a slice called “MySlice”.   I deleted that slice and re-ran your script and it worked.  This was the result:

              pruth@pruth-laptop Desktop % python3 hello_edited.py
              Name      CPUs  Cores    RAM (G)    Disk (G)       Basic (100 Gbps NIC)    ConnectX-6 (100 Gbps x2 NIC)    ConnectX-5 (25 Gbps x2 NIC)    P4510 (NVMe 1TB)    Tesla T4 (GPU)    RTX6000 (GPU)
              ------  ------  -------  ---------  -------------  ----------------------  ------------------------------  -----------------------------  ------------------  ----------------  ---------------
              MICH         6  190/192  1530/1536  60590/60600    381/381                 0/2                             2/2                            10/10               2/2               3/3
              UTAH        10  320/320  2560/2560  116400/116400  635/635                 2/2                             4/4                            16/16               4/4               5/5
              TACC        10  238/320  2328/2560  115590/116400  632/635                 2/2                             4/4                            16/16               4/4               6/6
              WASH         6  188/192  1520/1536  60580/60600    379/381                 2/2                             2/2                            10/10               2/2               3/3
              NCSA         6  192/192  1536/1536  60600/60600    381/381                 2/2                             2/2                            10/10               2/2               3/3
              DALL         6  190/192  1528/1536  60590/60600    381/381                 2/2                             2/2                            10/10               2/2               3/3
              MAX         10  290/320  2452/2560  116190/116400  619/635                 1/2                             4/4                            16/16               4/4               6/6
              MASS         4  120/128  992/1024   55700/55800    254/254                 1/2                             0/0                            6/6                 0/0               3/3
              SALT         6  184/192  1504/1536  60500/60600    380/381                 2/2                             2/2                            10/10               2/2               3/3
              STAR        12  368/384  3008/3072  121060/121200  757/762                 2/2                             6/6                            20/20               6/6               4/6
              
              Waiting for slice ........... Slice state: StableOK
              Waiting for ssh in slice .. ssh successful
              Running post boot config ... Done!
              ---------------  ------------------------------------
              Slice Name       MySlice
              Slice ID         fba02fd7-423e-4309-9954-c3cbff38870a
              Slice State      StableOK
              Lease End (UTC)  2022-07-16 15:11:53 +0000
              ---------------  ------------------------------------
              -----------------  ------------------------------------------------------------------------------------------------------------------------------------------------------
              ID                 59eda82a-b9b7-4670-b830-40cff59e18cc
              Name               Node1
              Cores              2
              RAM                8
              Disk               10
              Image              default_rocky_8
              Image Type         qcow2
              Host               dall-w3.fabric-testbed.net
              Site               DALL
              Management IP      2001:400:a100:3000:f816:3eff:fe7e:5477
              Reservation State  Active
              Error Message
              SSH Command        ssh -i /Users/pruth/work/fabric_config/slice-private-key -J pruth_0031379841@bastion-1.fabric-testbed.net rocky@2001:400:a100:3000:f816:3eff:fe7e:5477
              -----------------  ------------------------------------------------------------------------------------------------------------------------------------------------------
              Hello, FABRIC from node 59eda82a-b9b7-4670-b830-40cff59e18cc-node1
              
              pruth@pruth-laptop Desktop % 
              

              Is this your issue too?
               

              • This reply was modified 3 years, 6 months ago by Paul Ruth.
              • This reply was modified 3 years, 6 months ago by Paul Ruth.
              Paul Ruth
              Keymaster

                I think I fixed it so you can attach .py and .txt file.  Can you try again?

                 

                in reply to: cannot login to reserved nodes #2298
                Paul Ruth
                Keymaster

                  Look at the example notebook called “Bastion Keypair”.  It sets up a ssh  config file that is necessary for ssh’ing from a command line.  You can add the path to your bastion key and your bastion user id to this notebook. Then run the notebook and it will create the correct ssh config file.

                  This is an initial response to a quirk in command line ssh when jumping through a host with -J.  For some reason you cannot pass the bastion host key on the command line.  The only way to do this is to have the bastion private key in a keychain or in the ssh config file.   SSHing from inside a notebook uses paramiko and does not need the ssh config file.

                  Very soon we will release a new version of fablib that will streamline a bunch of config including this issue.

                  Paul Ruth
                  Keymaster

                    Can you send me the python file you are using so I can try to recreate this issue?

                    Paul Ruth
                    Keymaster

                      You have the project ID set to the name of your project.  It should be set to the guid that can be found on the project’s page in the portal.  For example, the project ID for the FABRIC Tutorials project is circled in the attached image.

                      Paul

                       

                      • This reply was modified 3 years, 6 months ago by Paul Ruth.
                      • This reply was modified 3 years, 6 months ago by Paul Ruth.
                      Paul Ruth
                      Keymaster

                        I created a note for the developers.

                        thanks,

                        Paul

                         

                        Paul Ruth
                        Keymaster

                          The Jupyter notebooks are just python but it allows you to run them one cell at a time. Can you cut/paste the code from the cells of “Hello, FABRIC” notebook to a .py script and run it?  As long as your env vars and python libraries are setup correctly it should work.

                          in reply to: get_physical_os_interface()[‘ifname’] failed #2283
                          Paul Ruth
                          Keymaster

                            I think some of those debugging notebooks are old and maybe don’t work anymore.

                            You can use the 100G networks by just creating a WAN link that connects VMs using 100G NICs.   Any of the regular networking notebooks should work for this.  The only thing to think about is that, for now, dedicated quality of service guarantees are not available.  However, very little bandwidth is currently being used and you should not be limited by other users.

                            That said, we have only begun testing most of the links and have not confirmed the bandwidth we can achieve.  In theory, most of them should be able to get 100G but I suspect most of them will need some tuning. Please try this and let us know what you can achieve.

                            thanks,

                            Paul

                            • This reply was modified 3 years, 6 months ago by Paul Ruth.
                            Paul Ruth
                            Keymaster

                              Are you still having issues running your notebook?

                              Paul Ruth
                              Keymaster

                                Which tags do you need? Which project?

                                Paul

                                Paul Ruth
                                Keymaster

                                  I’m not sure what the problem is. When I try the code you posted it works.   I think this means it has something to do with your configuration.  Are you able to run the “Hello, FABRIC” notebook?   That one is, basically, a test that confirms the configuration is correct.

                                Viewing 15 posts - 151 through 165 (of 275 total)