1. Perhaps one of the bastion hosts is out

Perhaps one of the bastion hosts is out

Home Forums FABRIC General Questions and Discussion Perhaps one of the bastion hosts is out

Viewing 15 posts - 1 through 15 (of 17 total)
  • Author
    Posts
  • #9391
    Ilya Baldin
    Participant

      Today I try to login to my VMs and once in a while I hit dead air when connecting (hangs on connecting to the bastion). Then I try again and it works. Maybe a dead bastion host? I haven’t spent any time debugging.

      #9392
      Hussam Nasir
      Participant

        Hello Ilya,

        We have added two new bastions recently and modified one of the pre-existing ones. Can you please post the result of

        “nslookup bastion.fabric-testbed.net” from the machine where this failed?

        #9393
        Ilya Baldin
        Participant

          This is what I see (this is from my home on Google Fiber):

          nslookup bastion.fabric-testbed.net
          Server: 192.168.1.1
          Address: 192.168.1.1#53

          Non-authoritative answer:
          Name: bastion.fabric-testbed.net
          Address: 128.163.180.149
          Name: bastion.fabric-testbed.net
          Address: 23.134.235.242
          Name: bastion.fabric-testbed.net
          Address: 141.142.140.10
          Name: bastion.fabric-testbed.net
          Address: 152.54.15.12

          I also noticed that some commands sent to VMs over SSH via my laptop-local notebook don’t happen or are very delayed, which I suspect is part of the same issue. Strangely all these are reachable via ssh.

          #9396
          Hussam Nasir
          Participant

            One thing that stands out is that i dont see any IPv6 addresses of the bastion host in your name lookup. We had been seeing issues on IPv6 from home networks, but we believe that the workaround we placed for that has worked, as reported by other users.  I would also like to see the fablib.log file. Also, could you provide your source IP, as it’s possible that one of the bastions may have banned it?

            #9398
            Ilya Baldin
            Participant

              My ip is 136.61.60.222

              I do not have any IPv6 on my home network so it isn’t surprising. I’m using a DNS proxy, but even if I ask 8.8.8.8  directly I get:

              $ dig @8.8.8.8 bastion.fabric-testbed.net 
              
              ; <<>> DiG 9.10.6 <<>> @8.8.8.8 bastion.fabric-testbed.net
              ; (1 server found)
              ;; global options: +cmd
              ;; Got answer:
              ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 15505
              ;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 1
              
              ;; OPT PSEUDOSECTION:
              ; EDNS: version: 0, flags:; udp: 512
              ;; QUESTION SECTION:
              ;bastion.fabric-testbed.net. IN A
              
              ;; ANSWER SECTION:
              bastion.fabric-testbed.net. 3600 IN A 23.134.235.242
              bastion.fabric-testbed.net. 3600 IN A 128.163.180.149
              bastion.fabric-testbed.net. 3600 IN A 141.142.140.10
              bastion.fabric-testbed.net. 3600 IN A 152.54.15.12

              The log is full of the following messages

              [21:02:48] {/Users/baldin/venv/fabric/lib/python3.12/site-packages/fabrictestbed_extensions/fablib/node.py:1600} WARNING - Attempt 1 failed: ChannelException(2, 'Connect failed')
              [21:02:48] {/Users/baldin/venv/fabric/lib/python3.12/site-packages/fabrictestbed_extensions/fablib/node.py:1600} WARNING - Attempt 1 failed: ChannelException(2, 'Connect failed')
              [21:02:48] {/Users/baldin/venv/fabric/lib/python3.12/site-packages/fabrictestbed_extensions/fablib/node.py:1600} WARNING - Attempt 1 failed: ChannelException(2, 'Connect failed')
              [21:02:48] {/Users/baldin/venv/fabric/lib/python3.12/site-packages/fabrictestbed_extensions/fablib/node.py:1600} WARNING - Attempt 1 failed: ChannelException(2, 'Connect failed')
              [21:09:13] {/Users/baldin/venv/fabric/lib/python3.12/site-packages/paramiko/transport.py:1944} ERROR - Secsh channel 0 open FAILED: Connection timed out: Connect failed
              [21:09:13] {/Users/baldin/venv/fabric/lib/python3.12/site-packages/fabrictestbed_extensions/fablib/node.py:1600} WARNING - Attempt 1 failed: ChannelException(2, 'Connect failed')
              [21:14:12] {/Users/baldin/venv/fabric/lib/python3.12/site-packages/paramiko/transport.py:1944} ERROR - Secsh channel 0 open FAILED: Connection timed out: Connect failed
              [21:14:12] {/Users/baldin/venv/fabric/lib/python3.12/site-packages/fabrictestbed_extensions/fablib/node.py:1600} WARNING - Attempt 1 failed: ChannelException(2, 'Connect failed')
              [21:43:49] {/Users/baldin/venv/fabric/lib/python3.12/site-packages/paramiko/transport.py:1944} ERROR - Secsh channel 0 open FAILED: Connection timed out: Connect failed
              [21:43:49] {/Users/baldin/venv/fabric/lib/python3.12/site-packages/paramiko/transport.py:1944} ERROR - Secsh channel 0 open FAILED: Connection timed out: Connect failed
              [21:43:49] {/Users/baldin/venv/fabric/lib/python3.12/site-packages/fabrictestbed_extensions/fablib/node.py:1600} WARNING - Attempt 1 failed: ChannelException(2, 'Connect failed')
              [21:43:49] {/Users/baldin/venv/fabric/lib/python3.12/site-packages/fabrictestbed_extensions/fablib/node.py:1600} WARNING - Attempt 1 failed: ChannelException(2, 'Connect failed')
              [21:47:35] {/Users/baldin/venv/fabric/lib/python3.12/site-packages/paramiko/transport.py:1944} ERROR - Secsh channel 0 open FAILED: Connection timed out: Connect failed
              [21:47:35] {/Users/baldin/venv/fabric/lib/python3.12/site-packages/paramiko/transport.py:1944} ERROR - Secsh channel 0 open FAILED: Connection timed out: Connect failed
              [21:47:35] {/Users/baldin/venv/fabric/lib/python3.12/site-packages/paramiko/transport.py:1944} ERROR - Secsh channel 0 open FAILED: Connection timed out: Connect failed
              [21:47:35] {/Users/baldin/venv/fabric/lib/python3.12/site-packages/paramiko/transport.py:1944} ERROR - Secsh channel 0 open FAILED: Connection timed out: Connect failed
              [21:47:35] {/Users/baldin/venv/fabric/lib/python3.12/site-packages/fabrictestbed_extensions/fablib/node.py:1600} WARNING - Attempt 1 failed: ChannelException(2, 'Connect failed')
              [21:47:35] {/Users/baldin/venv/fabric/lib/python3.12/site-packages/fabrictestbed_extensions/fablib/node.py:1600} WARNING - Attempt 1 failed: ChannelException(2, 'Connect failed')
              [21:47:35] {/Users/baldin/venv/fabric/lib/python3.12/site-packages/fabrictestbed_extensions/fablib/node.py:1600} WARNING - Attempt 1 failed: ChannelException(2, 'Connect failed')
              [21:47:35] {/Users/baldin/venv/fabric/lib/python3.12/site-packages/fabrictestbed_extensions/fablib/node.py:1600} WARNING - Attempt 1 failed: ChannelException(2, 'Connect failed')
              [22:11:24] {/Users/baldin/venv/fabric/lib/python3.12/site-packages/paramiko/transport.py:1944} ERROR - Secsh channel 0 open FAILED: Connection timed out: Connect failed
              [22:11:24] {/Users/baldin/venv/fabric/lib/python3.12/site-packages/paramiko/transport.py:1944} ERROR - Secsh channel 0 open FAILED: Connection timed out: Connect failed
              [22:11:24] {/Users/baldin/venv/fabric/lib/python3.12/site-packages/paramiko/transport.py:1944} ERROR - Secsh channel 0 open FAILED: Connection timed out: Connect f
              ailed
              [22:11:24] {/Users/baldin/venv/fabric/lib/python3.12/site-packages/paramiko/transport.py:1944} ERROR - Secsh channel 0 open FAILED: Connection timed out: Connect failed
              [22:11:24] {/Users/baldin/venv/fabric/lib/python3.12/site-packages/paramiko/transport.py:1944} ERROR - Secsh channel 0 open FAILED: Connection timed out: Connect failed
              [22:11:24] {/Users/baldin/venv/fabric/lib/python3.12/site-packages/fabrictestbed_extensions/fablib/node.py:1600} WARNING - Attempt 1 failed: ChannelException(2, 'Connect failed')
              [22:11:24] {/Users/baldin/venv/fabric/lib/python3.12/site-packages/fabrictestbed_extensions/fablib/node.py:1600} WARNING - Attempt 1 failed: ChannelException(2, 'Connect failed')
              [22:11:24] {/Users/baldin/venv/fabric/lib/python3.12/site-packages/fabrictestbed_extensions/fablib/node.py:1600} WARNING - Attempt 1 failed: ChannelException(2, 'Connect failed')
              [22:11:24] {/Users/baldin/venv/fabric/lib/python3.12/site-packages/fabrictestbed_extensions/fablib/node.py:1600} WARNING - Attempt 1 failed: ChannelException(2, 'Connect failed')
              [22:11:24] {/Users/baldin/venv/fabric/lib/python3.12/site-packages/fabrictestbed_extensions/fablib/node.py:1600} WARNING - Attempt 1 failed: ChannelException(2, 'Connect failed')
              #9399
              Hussam Nasir
              Participant

                The fablib logs do not tell which bastion it tried. Possibly enabling verbose debug. The other thing we can try to do is ssh  directly to the bastion one by to see which one fails to connect (all will fail to SSH since direct SSH is not allowed)

                Here is the list of all the bastions https://learn.fabric-testbed.net/knowledge-base/frequently-asked-starter-questions/ (last question in the FAQ)

                #9400
                Ilya Baldin
                Participant

                  Yeah strangely I can connect to all of them right now, so it must be intermittent. I may change my fabric ssh config to use a specific bastion and see if that changes how things work. I’ll update the debug level to see if I can catch it in the act also.

                  • This reply was modified 2 weeks, 5 days ago by Ilya Baldin.
                  #9404
                  Ilya Baldin
                  Participant

                    I experimentally determined (by manually specifying which bastion to use) that it is bastion-star-1 that is hanging for me.

                    #9406
                    Hussam Nasir
                    Participant

                      I do see in the bastion star logs sucessfull logins from your id even from early morning today.

                      #9407
                      Ilya Baldin
                      Participant

                        That’s suspect. (a) I was not doing anything this morning and (b) if I configure to use bastion-star-1 as my bastion host I cannot login to my slice (still); it works if I configure e.g. bastion-renc-1

                        #9408
                        Hussam Nasir
                        Participant

                          Can you post the full ssh command? The issue may be the connection between STAR and the destination rack

                          #9409
                          Ilya Baldin
                          Participant

                            So for one of the nodes I do something like that:

                            ssh -i /path/to/slice_key -F ~/path/to/fabric_config ubuntu@2001:400:a100:3030:f816:3eff:fe07:665e

                            and my fabric_config looks something like this:

                            UserKnownHostsFile /dev/null
                            StrictHostKeyChecking no
                            ServerAliveInterval 120

                            Host bastion-star-1.fabric-testbed.net
                            User username
                            ForwardAgent yes
                            Hostname %h
                            IdentityFile ~/.ssh/mykey
                            IdentitiesOnly yes

                            Host * !bastion-star-1.fabric-testbed.net
                            ProxyJump username@bastion-star-1.fabric-testbed.net:22

                            • This reply was modified 2 weeks, 4 days ago by Ilya Baldin.
                            #9412
                            Hussam Nasir
                            Participant

                              Good news is that i have narrowed the issue down to your VM at STAR. Is the issue when using VMs at other sites too ?

                              #9413
                              Ilya Baldin
                              Participant

                                Fascinating. I do not have slices at other sites. I have one slice and all of it is in STAR and as far as I can tell all nodes have this problem.

                                Slice ID is 16c49677-636b-4d3c-b71d-7fff7a75db09

                                • This reply was modified 2 weeks, 4 days ago by Ilya Baldin.
                                #9415
                                Hussam Nasir
                                Participant

                                  I see the issue on your VM. I believe you are using a FABNETv4 EXT and FABnetv6 EXT. During its configuration, you may have accidentally added the NIC to be used in the default route. This caused the system to have two default routes going out two different interfaces.

                                  2: enp3s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc fq_codel state UP group default qlen 1000
                                  link/ether fa:16:3e:07:66:5e brd ff:ff:ff:ff:ff:ff
                                  inet 10.30.6.168/23 metric 100 brd 10.30.7.255 scope global dynamic enp3s0
                                  valid_lft 58217sec preferred_lft 58217sec
                                  inet6 2001:400:a100:3030:f816:3eff:fe07:665e/64 scope global dynamic mngtmpaddr noprefixroute
                                  valid_lft 86383sec preferred_lft 14383sec
                                  inet6 fe80::f816:3eff:fe07:665e/64 scope link
                                  valid_lft forever preferred_lft forever
                                  3: enp7s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
                                  link/ether 2a:77:c8:60:d3:bf brd ff:ff:ff:ff:ff:ff
                                  inet 10.129.130.253/24 scope global enp7s0
                                  valid_lft forever preferred_lft forever
                                  inet 23.134.235.195/28 scope global enp7s0
                                  valid_lft forever preferred_lft forever
                                  inet6 2602:fcfb:101::3/28 scope global
                                  valid_lft forever preferred_lft forever
                                  inet6 2602:fcfb:101:0:2877:c8ff:fe60:d3bf/64 scope global dynamic mngtmpaddr
                                  valid_lft 2591807sec preferred_lft 604607sec
                                  inet6 fe80::2877:c8ff:fe60:d3bf/64 scope link
                                  valid_lft forever preferred_lft forever
                                  4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
                                  link/ether ca:03:f6:60:d5:34 brd ff:ff:ff:ff:ff:ff
                                  inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
                                  valid_lft forever preferred_lft forever

                                  root@SenderSTAR:~# ip -6 route show
                                  ::1 dev lo proto kernel metric 256 pref medium
                                  2001:400:a100:3030::/64 dev enp3s0 proto ra metric 100 expires 86371sec pref medium
                                  2001:400:a300::/48 via 2602:fcfb:101::1 dev enp7s0 metric 1024 pref medium
                                  2602:fcfb:101::/64 dev enp7s0 proto kernel metric 256 expires 2591914sec pref medium
                                  2602:fcf0::/28 dev enp7s0 proto kernel metric 256 pref medium
                                  fe80::a9fe:a9fe via fe80::f816:3eff:fe79:edec dev enp3s0 proto ra metric 100 expires 271sec pref medium
                                  fe80::/64 dev enp3s0 proto kernel metric 256 pref medium
                                  fe80::/64 dev enp7s0 proto kernel metric 256 pref medium
                                  default via fe80::f816:3eff:fe79:edec dev enp3s0 proto ra metric 100 expires 271sec mtu 9000 pref medium
                                  default via fe80::c28b:2aff:fe82:6d02 dev enp7s0 proto ra metric 1024 expires 1714sec hoplimit 64 pref medium

                                  As soon as i disable the enp7s0 NIC using ip link set dev enp7s0 down , the VM started working via ssh.

                                  It is possible that there is a routing issue when FABNETv6 is used at STAR with STAR Bastion. I will ask for this use case to be investigated.

                                Viewing 15 posts - 1 through 15 (of 17 total)
                                • You must be logged in to reply to this topic.