1. L2 network between sites often have nodes that cannot reach one another

L2 network between sites often have nodes that cannot reach one another

Home Forums FABRIC General Questions and Discussion L2 network between sites often have nodes that cannot reach one another

Viewing 9 posts - 1 through 9 (of 9 total)
  • Author
    Posts
  • #2593
    Gregory Daues
    Participant

      I have been creating L2 networks and adding IP numbers between a pair of FABRIC sites using the code snippets of
      jupyter-examples/fabric_examples/fablib_api/create_l2network_wide_area
      e.g.,
      # Network
      net1 = slice.add_l2network(name=network_name, interfaces=[iface1, iface2])

      I have observed that the two nodes on such a network can  reach one another (ping, ssh) for a bit less than half of the Slices created,  and that for a bit more than half  of the Slices the nodes cannot reach one another . The addition of the IP numbers seems to have succeeded as each node can ping its own IP number.

      [rocky@666fe764-bb4e-4302-b8af-1f1657809e2d-cmbs4node-star2 ~]$ ping  1235:5679::2

      PING 1235:5679::2(1235:5679::2) 56 data bytes

      64 bytes from 1235:5679::2: icmp_seq=1 ttl=64 time=0.040 ms

      64 bytes from 1235:5679::2: icmp_seq=2 ttl=64 time=0.014 ms

      64 bytes from 1235:5679::2: icmp_seq=3 ttl=64 time=0.009 ms

      64 bytes from 1235:5679::2: icmp_seq=4 ttl=64 time=0.008 ms

       

       

      [rocky@666fe764-bb4e-4302-b8af-1f1657809e2d-cmbs4node-star2 ~]$ ping  1235:5679::1

      PING 1235:5679::1(1235:5679::1) 56 data bytes

      From 1235:5679::2: icmp_seq=1 Destination unreachable: Address unreachable

      From 1235:5679::2: icmp_seq=2 Destination unreachable: Address unreachable

      From 1235:5679::2: icmp_seq=3 Destination unreachable: Address unreachable

      (ping -6 gives the same.)

      Just from the samples of my attempts it seems that the success/failure may depend on the pair of sites. e.g.,
      NCSA-STAR has failed twice, SALT-UTAH has succeeded twice, but that a particular site does not necessarily point to failure, as NCSA-MICH succeeded,  DALL-STAR succeeded.

      Any ideas on how to dig into the sporadic behavior?  Would this be some kind of firewall/security group issue — would this be  visible in cloud-init logging on a node, or is such firewall/security  handled in FABRIC services?

      Greg

      #2594
      Gregory Daues
      Participant

        I having difficulty reproducing the previous failures, so perhaps something has been fixed up (?) I’ll continue to check for issues.

        #2616
        Paul Ruth
        Keymaster

          I suspect this was an intermittent issue with some underlying infrastructure.  Or possibly a link that was slow to be instantiated.  It might be if you wait a few minutes the link will become active.

          If you see this again can you respond to this forum thread and include your slice ID?  If the right developers are available they can look at the underlying code/infrastructure and see if this is a bug.

          Also, let us know if you figure out how to consistently recreate it.

          Paul

          • This reply was modified 2 years, 3 months ago by Paul Ruth.
          #2617
          Gregory Daues
          Participant

            Yes I saw this failure maybe 7 times over a stretch of 2-3 days last week, but it has not arisen just recently.  I will monitor if I can get an as-it-happens example.

            #2636
            Gregory Daues
            Participant

              I was able to reproduce a case where the nodes in the Slice with the L2 network cannot reach one another. The Slice I just made this morning is  MySliceAug12A

              “ip addr list eth1” on the nodes shows respectively

              3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000

              link/ether 02:df:f2:32:8d:ad brd ff:ff:ff:ff:ff:ff

              inet6 1244:5679::1/64 scope global

              valid_lft forever preferred_lft forever

               

              3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000

              link/ether 02:4b:b8:69:ef:db brd ff:ff:ff:ff:ff:ff

              inet6 1244:5679::2/64 scope global

              valid_lft forever preferred_lft forever

              but on “node1”

              [rocky@75af8f33-0a4f-4941-824c-4a010d40f20f-cmbs4node-wash1 ~]$ ping 1244:5679::1

              PING 1244:5679::1(1244:5679::1) 56 data bytes

              64 bytes from 1244:5679::1: icmp_seq=1 ttl=64 time=0.028 ms

              64 bytes from 1244:5679::1: icmp_seq=2 ttl=64 time=0.007 ms

              64 bytes from 1244:5679::1: icmp_seq=3 ttl=64 time=0.005 ms

              64 bytes from 1244:5679::1: icmp_seq=4 ttl=64 time=0.005 ms

              ^C

              — 1244:5679::1 ping statistics —

              4 packets transmitted, 4 received, 0% packet loss, time 3089ms

              rtt min/avg/max/mdev = 0.005/0.011/0.028/0.010 ms

              [rocky@75af8f33-0a4f-4941-824c-4a010d40f20f-cmbs4node-wash1 ~]$ ping 1244:5679::2

              PING 1244:5679::2(1244:5679::2) 56 data bytes

              From 1244:5679::1: icmp_seq=1 Destination unreachable: Address unreachable

              From 1244:5679::1: icmp_seq=2 Destination unreachable: Address unreachable

              From 1244:5679::1: icmp_seq=3 Destination unreachable: Address unreachable

              ^C

              — 1244:5679::2 ping statistics —

              6 packets transmitted, 0 received, +3 errors, 100% packet loss, time 5105ms

              pipe 3

              [rocky@75af8f33-0a4f-4941-824c-4a010d40f20f-cmbs4node-wash1 ~]$ curl -v telnet://[1244:5679::2]:22

              * Rebuilt URL to: telnet://[1244:5679::2]:22/

              *   Trying 1244:5679::2…

              * TCP_NODELAY set

              * connect to 1244:5679::2 port 22 failed: No route to host

              * Failed to connect to 1244:5679::2 port 22: No route to host

              * Closing connection 0

              curl: (7) Failed to connect to 1244:5679::2 port 22: No route to host

              [rocky@75af8f33-0a4f-4941-824c-4a010d40f20f-cmbs4node-wash1 ~]$ curl -6 -v telnet://[1244:5679::2]:22

              * Rebuilt URL to: telnet://[1244:5679::2]:22/

              *   Trying 1244:5679::2…

              * TCP_NODELAY set

              * connect to 1244:5679::2 port 22 failed: No route to host

              * Failed to connect to 1244:5679::2 port 22: No route to host

              * Closing connection 0

              curl: (7) Failed to connect to 1244:5679::2 port 22: No route to host

              Greg

               

               

              #2638
              Paul Ruth
              Keymaster

                Is this slice still up? I’m seeing if someone can look at it.

                #2639
                Gregory Daues
                Participant

                  Yes, that slice MySliceAug12A is still up; there is another active slice MySliceAug12B which did not exhibit the issue (different sites).  Project is “CMB-S4 Phase one”.

                  #2640
                  Komal Thareja
                  Participant

                    Greg, could you please delete this slice and recreate it?

                    We had some leftover layer3 connections from testing which were causing the issue. We were able to identify and clear them. It should work now. Please let us know if you still face this issue.

                    #2641
                    Gregory Daues
                    Participant

                      I have created a new slice  MySliceAug12C  with the same attributes and the issue has not occurred; the nodes /ports are reachable.      I’ll watch for if it ever occurs again,   but it looks like it should be resolved !

                    Viewing 9 posts - 1 through 9 (of 9 total)
                    • You must be logged in to reply to this topic.