Opened 4 years ago

Closed 3 years ago

Last modified 3 years ago

#3183 closed defect (fixed)

fix network "spottiness" on new infra

Reported by: dustin Owned by: verm
Priority: critical Milestone: sys - other
Version: Keywords:
Cc:

Description

We're seeing a lot of this sort of thing:

ansible-pull run on on syslog.int.buildbot.net failed.

Starting ansible-pull at 2015-02-05 08:27:29
localhost | FAILED >> {
    "cmd": "/usr/local/bin/git ls-remote origin -h refs/heads/master", 
    "failed": true, 
    "msg": "fatal: Unable to look up github.com (port 9418) (hostname nor servname provided, or not known)", 
    "rc": 128, 
    "stderr": "fatal: Unable to look up github.com (port 9418) (hostname nor servname provided, or not known)\n", 
    "stdout": ""
}
ansible-pull run on on mysql.int.buildbot.net failed.

Starting ansible-pull at 2015-02-03 21:10:00
localhost | FAILED >> {
    "failed": true,
    "msg": "Failed to download remote objects and refs"
}
ansible-pull run on on bslave1.int.buildbot.net failed.

Starting ansible-pull at 2015-01-30 10:00:00
Sleeping for 81 seconds...
localhost | FAILED >> {
    "cmd": "/usr/local/bin/git ls-remote origin -h refs/heads/master",
    "failed": true,
    "msg": "fatal: Unable to look up github.com (port 9418) (hostname nor servname provided, or not known)",
    "rc": 128,
    "stderr": "fatal: Unable to look up github.com (port 9418) (hostname nor servname provided, or not known)\n",
    "stdout": ""
}

Those are all fairly recent. On January 22, we had a bunch of hosts (not all *.int.bb.net) have a similar problem, but maybe that was a fluke.

I also see this quite often from Trac:

Warning: The change has been saved, but an error occurred while sending notifications: SMTP server connection error ([Errno 60] Operation timed out). Please modify [notification] smtp_server or [notification] smtp_port in your configuration.

The latter means that we're probably losing some Trac notifications which is never good.

Change History (49)

comment:1 Changed 4 years ago by dustin

Over the last two days while I was working on categorizing nine bugs, I saw a lot of hung connections - either taking forever to load a page, or eventually giving up and returning an empty document. Don't know if that's related. And I'm going to try posting to this bug to see if i can replicate, so apologies for the spam.

comment:2 Changed 4 years ago by dustin

On that post, I once again got "Warning: the change has been saved ..", and the connection hung for quite a while. I put the tcpdumps from both trac and my laptop in https://gist.github.com/djmitche/4dad36e0fdbfcc5cc215, with some editing to make them more useful. Overall, the connection shows a 78-second TCP connection. The "trac" host is using TSO, which means that tcpdump on that host captures the oversized segments that the kernel hands to the NIC, while tcpdump on the laptop captures the smaller segments into which the NIC breaks them. This all seems to work OK, although it makes it difficult to match up packets.

Things start off well enough:

4:03 IP laptop.60061 > trac.http: Flags [S], seq 833594151, win 29200, options [mss 1460,sackOK,TS val 17271196 ecr 0,nop,wscale 7], length 0
4:03 IP trac.http > laptop.60061: Flags [S.], seq 1018352583, ack 833594152, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 4013412119 ecr 17271196], length 0
4:03 IP laptop.60061 > trac.http: Flags [.], ack 1, win 229, options [nop,nop,TS val 17271280 ecr 4013412119], length 0
4:03 IP laptop.60061 > trac.http: Flags [.], seq 1:1449, ack 1, win 229, options [nop,nop,TS val 17271280 ecr 4013412119], length 1448
4:03 IP laptop.60061 > trac.http: Flags [.], seq 1449:2897, ack 1, win 229, options [nop,nop,TS val 17271280 ecr 4013412119], length 1448
4:03 IP laptop.60061 > trac.http: Flags [P.], seq 2897:3681, ack 1, win 229, options [nop,nop,TS val 17271282 ecr 4013412119], length 784
4:04 IP trac.http > laptop.60061: Flags [.], ack 2897, win 1018, options [nop,nop,TS val 4013412205 ecr 17271280], length 0
4:04 IP trac.http > laptop.60061: Flags [.], ack 3681, win 1040, options [nop,nop,TS val 4013412312 ecr 17271282], length 0

laptop sends a 3681-byte HTTP request, and track acks receipt.

10 seconds later, and every 10 seconds thereafter, laptop pings and track pongs:

4:14 IP laptop.60061 > trac.http: Flags [.], ack 1, win 229, options [nop,nop,TS val 17281473 ecr 4013412312], length 0
4:14 IP trac.http > laptop.60061: Flags [.], ack 3681, win 1040, options [nop,nop,TS val 4013422396 ecr 17271282], length 0
4:24 IP laptop.60061 > trac.http: Flags [.], ack 1, win 229, options [nop,nop,TS val 17291584 ecr 4013422396], length 0
4:24 IP trac.http > laptop.60061: Flags [.], ack 3681, win 1040, options [nop,nop,TS val 4013432507 ecr 17271282], length 0
4:34 IP laptop.60061 > trac.http: Flags [.], ack 1, win 229, options [nop,nop,TS val 17301696 ecr 4013432507], length 0
4:34 IP trac.http > laptop.60061: Flags [.], ack 3681, win 1040, options [nop,nop,TS val 4013442619 ecr 17271282], length 0
4:44 IP laptop.60061 > trac.http: Flags [.], ack 1, win 229, options [nop,nop,TS val 17311808 ecr 4013442619], length 0
4:44 IP trac.http > laptop.60061: Flags [.], ack 3681, win 1040, options [nop,nop,TS val 4013452730 ecr 17271282], length 0
4:54 IP laptop.60061 > trac.http: Flags [.], ack 1, win 229, options [nop,nop,TS val 17321920 ecr 4013452730], length 0
4:54 IP trac.http > laptop.60061: Flags [.], ack 3681, win 1040, options [nop,nop,TS val 4013462842 ecr 17271282], length 0
5:04 IP laptop.60061 > trac.http: Flags [.], ack 1, win 229, options [nop,nop,TS val 17332032 ecr 4013462842], length 0
5:04 IP trac.http > laptop.60061: Flags [.], ack 3681, win 1040, options [nop,nop,TS val 4013472954 ecr 17271282], length 0

I don't recognize this pattern, but it looks like the hosts do, so OK.

A full 76 seconds after the connection was initiated, trac begins replying:

5:19 IP trac.http > laptop.60061: Flags [P.], seq 1:399, ack 3681, win 1040, options [nop,nop,TS val 4013487830 ecr 17271282], length 398

this goes on for a while, seemingly without any issue.

Finally, after 47419 bytes of response, the laptop sends a FIN:

5:21 IP trac.http > laptop.60061: Flags [.], seq 45971:47419, ack 4762, win 1040, options [nop,nop,TS val 4013490058 ecr 17349135], length 1448
5:21 IP laptop.60061 > trac.http: Flags [.], ack 47419, win 984, options [nop,nop,TS val 17349220 ecr 4013490058], length 0
5:21 IP laptop.60061 > trac.http: Flags [F.], seq 4762, ack 47419, win 984, options [nop,nop,TS val 17349220 ecr 4013490058], length 0

But trac isn't done yet, and keeps sending data, to which the laptop sends RST packets:

5:21 IP trac.http > laptop.60061: Flags [.], seq 47419:48867, ack 4762, win 1040, options [nop,nop,TS val 4013490059 ecr 17349135], length 1448
5:21 IP laptop.60061 > trac.http: Flags [R], seq 833598913, win 0, length 0
5:21 IP trac.http > laptop.60061: Flags [.], seq 48867:50315, ack 4762, win 1040, options [nop,nop,TS val 4013490060 ecr 17349136], length 1448
5:21 IP laptop.60061 > trac.http: Flags [R], seq 833598913, win 0, length 0
5:21 IP trac.http > laptop.60061: Flags [.], seq 50315:50351, ack 4762, win 1040, options [nop,nop,TS val 4013490060 ecr 17349136], length 36
5:21 IP laptop.60061 > trac.http: Flags [R], seq 833598913, win 0, length 0
5:21 IP trac.http > laptop.60061: Flags [.], seq 50351:51799, ack 4762, win 1040, options [nop,nop,TS val 4013490143 ecr 17349219], length 1448
5:21 IP laptop.60061 > trac.http: Flags [R], seq 833598913, win 0, length 0
5:21 IP trac.http > laptop.60061: Flags [.], seq 51799:53247, ack 4762, win 1040, options [nop,nop,TS val 4013490144 ecr 17349220], length 1448
5:21 IP laptop.60061 > trac.http: Flags [R], seq 833598913, win 0, length 0
5:21 IP trac.http > laptop.60061: Flags [.], seq 53247:54695, ack 4762, win 1040, options [nop,nop,TS val 4013490144 ecr 17349220], length 1448
5:21 IP laptop.60061 > trac.http: Flags [R], seq 833598913, win 0, length 0
5:21 IP trac.http > laptop.60061: Flags [.], seq 54695:56143, ack 4763, win 1040, options [nop,nop,TS val 4013490144 ecr 17349220], length 1448
5:21 IP laptop.60061 > trac.http: Flags [R], seq 833598914, win 0, length 0

So, two funny things here. First, it took well over a minute for Trac to respond -- which is probably associated with the failed attempt to send an email. That's probably the bug we're after here.

Second, trac's sending more data than the browser expects. This might be due to a missing or incorrect Content-Length header. I'm not going to worry about that one right now.

On a whim, I disabled TSO on service2's em0 and em1 interfaces. Let's see if that helps!

comment:3 Changed 4 years ago by dustin

Well, that one certainly posted quickly.

comment:4 Changed 4 years ago by dustin

That one too.

comment:5 Changed 4 years ago by dustin

Still going!

comment:6 Changed 4 years ago by dustin

OK, one more and I'll revert the TSO changes.

comment:7 Changed 4 years ago by dustin

Indeed, that was fast too. Now, with TSO re-enabled..

comment:8 Changed 4 years ago by dustin

Still fast

comment:9 Changed 4 years ago by dustin

Slower

comment:10 Changed 4 years ago by dustin

Fast.. hmm, maybe not the smoking gun.

comment:11 Changed 4 years ago by dustin

Sorry

comment:12 Changed 4 years ago by dustin

That one was much slower, and again the email warning banner. Interesting fact: when I clicked "Submit changes" it was before the last preview had rendered. So when that last preview *did* render, it found the comment in the DB already and warned me that "someone else" had already commented. So things are sticking *after* the DB insert, which again points to the email submission being the issue.

Note that the interfaces are arranged as follows: {em0, em1} aggregated in lagg0, where the default VLAN is the external subnet; vlan1 uses lagg0 with vlan tag 80 for the internal subnet. So adjusting TSO for em0/em1 affects all communication, not just external.

Let's try it again!!

comment:13 Changed 4 years ago by dustin

1 .. 2 .. 3

comment:14 Changed 4 years ago by dustin

test test test

comment:15 Changed 4 years ago by dustin

mail's going over the internal network:

smtp_server = 192.168.80.235

comment:16 Changed 4 years ago by dustin

The last was fast, and I see a relatively normal-looking tcp transaction with mx's internal interface.

On mx, comment 11 doesn't even appear in the logs - there's nothing from around that timestamp in syslog.

comment:17 Changed 4 years ago by dustin

That one was quick, and I see an email sent successfully, but immediately after that:

21:53:05.328047 IP 192.168.80.241.32849 > 192.168.80.235.25: Flags [S], seq 1991036704, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 1306763181 ecr 0], length 0
21:53:08.342195 IP 192.168.80.241.32849 > 192.168.80.235.25: Flags [S], seq 1991036704, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 1306766195 ecr 0], length 0
21:53:11.565735 IP 192.168.80.241.32849 > 192.168.80.235.25: Flags [S], seq 1991036704, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 1306769419 ecr 0], length 0
21:53:14.767984 IP 192.168.80.241.32849 > 192.168.80.235.25: Flags [S], seq 1991036704, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 1306772621 ecr 0], length 0
21:53:17.976227 IP 192.168.80.241.32849 > 192.168.80.235.25: Flags [S], seq 1991036704, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 1306775829 ecr 0], length 0
21:53:21.213997 IP 192.168.80.241.32849 > 192.168.80.235.25: Flags [S], seq 1991036704, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 1306779067 ecr 0], length 0
21:53:27.456504 IP 192.168.80.241.32849 > 192.168.80.235.25: Flags [S], seq 1991036704, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 1306785310 ecr 0], length 0

in other words, a hanging connection attempt to mx. So we're narrowing down the bug here!

comment:18 Changed 4 years ago by dustin

Indeed, another email warning, and a bunch of un-answered SYNs trying to connect. I don't see the SYNs in tcpdump -i vlan0 -n on mx. I'll try running the tcpdumps on the hosts themselves, in case jailing has an effect (although it doesn't seem to - I see traffic to other jails).

comment:19 Changed 4 years ago by dustin

I see lists.buildbot.net having trouble connecting to mx, too. I didn't quite capture a tcpdump, though.

comment:20 Changed 4 years ago by dustin

Got it! On service2:

# tcpdump -i vlan0 -n -p ip host 192.168.80.235
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vlan0, link-type EN10MB (Ethernet), capture size 65535 bytes
capability mode sandbox enabled
22:15:52.421511 IP 192.168.80.240.33466 > 192.168.80.235.25: Flags [S], seq 3465709852, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 1308130274 ecr 0], length 0
22:15:55.443641 IP 192.168.80.240.33466 > 192.168.80.235.25: Flags [S], seq 3465709852, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 1308133297 ecr 0], length 0
22:15:58.690641 IP 192.168.80.240.33466 > 192.168.80.235.25: Flags [S], seq 3465709852, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 1308136544 ecr 0], length 0
22:16:01.901640 IP 192.168.80.240.33466 > 192.168.80.235.25: Flags [S], seq 3465709852, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 1308139755 ecr 0], length 0

while on service1:

# tcpdump -i vlan0 -n -p ip host 192.168.80.240 or ip host 192.168.80.241
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vlan0, link-type EN10MB (Ethernet), capture size 65535 bytes
capability mode sandbox enabled

(nothing).

comment:21 Changed 4 years ago by dustin

test

comment:22 Changed 4 years ago by dustin

test

comment:23 Changed 4 years ago by dustin

test

comment:24 Changed 4 years ago by dustin

now it won't fail :(

comment:25 Changed 4 years ago by dustin

test

comment:26 Changed 4 years ago by dustin

test

comment:27 Changed 4 years ago by dustin

y u no fail

comment:28 Changed 4 years ago by dustin

Ah! The switch is logging

Feb 26 20:38:35.767: %SW_MATM-4-MACFLAP_NOTIF: Host 0025.90d6.538c in vlan 80 is flapping between port Po7 and port Po33

comment:29 Changed 4 years ago by dustin

That MAC belongs to service4.rtems.org, which should be in port-group 7 *only*, not port-group 33. Port-group 33 is vm1. So our best guess is that they are accidentally cross-wired. But it's hard to see why that would cause the symptoms we're seeing here.

comment:30 Changed 4 years ago by anonymous

Ah, now

Feb 26 20:51:10.381: %SW_MATM-4-MACFLAP_NOTIF: Host 0025.90d5.1d40 in vlan 1 is flapping between port Po33 and port Po30

po30 is service1. So at least it's reasonable that the issue would impact service1. Still, we've now seen three MACs across po33, and there are only two ports in the group, so .. something's not clear.

comment:31 Changed 4 years ago by dustin

  • Keywords verm added

Here's a cute little summary of what we know:

                      assigned  flapping
                      --------  --------
1d40 (service1.bb)      po30    po30 po33
53f6 (service2.bb)      po31    po31 po33
1d48 (service3.bb)      po32    po32 po33
6ae2 (vm1.bb)           po33    
ddfe (firewall1)        po1     po1  po33
1d36 (service2.rtems)   po5     po5  po33
538c (service4.rtems)   po7     po7  po33

Basically, every host except vm1 is "flapping between" its own assigned port-group and port-group 33. vm1 itself is assigned to port-group 33, but LACP on both vm1 and the switch indicate that only one port of that pair is working.

Amar's going to pore over the config tomorrow looking for issues, and file a bug with OSUOSL to get them to check the wiring.

comment:32 Changed 4 years ago by verm

  • Owner set to verm
  • Status changed from new to accepted

comment:33 Changed 4 years ago by dustin

  • Keywords verm removed

comment:34 Changed 4 years ago by verm

I had some emergency work come in. I will be on this today just not as early as I had hoped

comment:35 Changed 4 years ago by verm

I dumped the switch config and carefully compared it to my cable plan. I also checked the mac address tables for each port group. I don't see anything obviously wrong. I'll get the OSL to verify the cabling anyways just to be thorough.

After this I will look for bugs in the OS version on each switch and possibly upgrade.

comment:36 Changed 4 years ago by dustin

OK, so let's reason through this. All of these hosts are showing up on Po33, but there are only two physical ports in Po33. So there's a bridge somewhere.

The only thing plugged into Po33 is vm1. Does vm1 have a bridge? Yup.

bridge0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        ether 02:fe:61:83:ab:00
        nd6 options=1<PERFORMNUD>
        id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
        maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200
        root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
        member: lagg0 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
                ifmaxaddr 0 port 6 priority 128 path cost 20000
        member: tap0 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
                ifmaxaddr 0 port 9 priority 128 path cost 2000000
        member: vlan0 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
                ifmaxaddr 0 port 7 priority 128 path cost 2000000

I'll wager that putting a lagg port into a bridge allows traffic to bridge in one port and out the other.

I took igb1 down on that host, just to be sure. From what I can tell it was already disabled by the switch. Even before doing so, I wasn't able to replicate the network error, although I've gotten various emails today indicating failure.

I'm not sure how to fix this. I know that LACP is, in part, intended to replace STP, so it doesn't make sense to run STP across those two links. Maybe this is a FreeBSD bug?

comment:37 Changed 4 years ago by verm

lagg0 was never supposed to be in the bridge. When we were working on it the plan was to have it an internal-only VM. I do not know how lagg0 got added to the bridge. Also, adding vlan0 and lagg0 to the same bridge is a very bad idea this must be why we're seeing traffic across vlans.

comment:38 Changed 4 years ago by dustin

Of course, that's absolutely it. The bridge isn't bridging the component interfaces of the lagg -- it's bridging VLAN 80 (vlan0) and VLAN 1 (lagg0). So every broadcast packet on VLAN 80 ends up bridged to the switch on Po33, which causes the flaps.

There's no actual *loop*, but definitely a misconfiguration.

We deleted the bridge0 interface, which should fix the issue. Going forward, we'll need to bridge bhyve's tap interfaces to lagg0 and vlan0 individually.

comment:39 Changed 4 years ago by dustin

  • Resolution set to fixed
  • Status changed from accepted to closed

comment:40 Changed 4 years ago by dustin

Hm, I did just get another

     "msg": "fatal: Unable to look up github.com (port 9418) (hostname nor servname provided, or not known)",

from bslave1. That's not promising, but if it's the last one I see I'll be willing to chalk it up to carryover state.

comment:41 Changed 4 years ago by verm

Yeah, I wasn't convinced yet that the issue is gone. I think we're dealing with two separate issues.

I want to leave it for a bit to see what the switch logs if it's more of the same or something else. I brought the interface back up for the 2nd port on vm1. igb1 is not up though.

I want to reboot this machine.

comment:42 Changed 4 years ago by verm

The flapping has stopped however the lagg connection to vm1 is still broken. I'm wondering if that card is failing in a strange way. I will get the OSL to reverse the two cables tomorrow and see if it comes up in the same way.

comment:43 Changed 4 years ago by verm

Well, I figured out why LACP wasn't working on vm1:

 ifconfig_igb0="up"
-ifconfig_igb="up"
+ifconfig_igb1="up"

2nd interface wasn't being brought up...sigh. FreeBSD wasn't indicating this wasn't happening so there was no way to know.

comment:44 Changed 4 years ago by dustin

  • Resolution fixed deleted
  • Status changed from closed to reopened

Still (5h after the last comment) getting errors.

comment:45 Changed 4 years ago by verm

I'm sick today, I will be working on this on and off. I forgot to re-open this last night as it was clear the problem was not resolved. The network flapping was most likely not causing any issues other than annoying log messages.

comment:46 Changed 4 years ago by verm

I think we were seeing symptoms of two different issues:

  • Internal connection problems caused by the bridging of two vlans
  • External issues caused by one of the firewalls or switchs

I'll be doing some tests to see if I can unwedge whatever is going wrong.

comment:47 Changed 4 years ago by verm

It seems that the LACP was the cause of the internal connection issues. I have not been able to replicate those -- yet.

I'm leaning towards the issue being firewall2. Something in the config or hardware setup. I will be running some tests this weekend taking it offline and seeing if I can pickup where the failure might be.

firewal1 works just fine. However 100% of the time if I connect to firewall2 my ssh freezes with no chance of recovery. Connecting to firewall2 using 'the back way' works fine. Since this is a redundant setup I can offline firewall2 without losing any connectivity.

Since I'm battling a sore throat and very lethargic I will wait until the weekend to do this when I'm more clear headed.

comment:48 Changed 3 years ago by dustin

  • Resolution set to fixed
  • Status changed from reopened to closed

From email:

---

Upgrade appears to have gone without a hitch. I eventually rebooted firewall1 which fixed the problems with firewall2.

I don't know what the problem was exactly. My best guess it was a bug in the OS that wasn't expressing itself in a detectable way. I looked at every aspect of the network and saw nothing wrong. Unfortunately some of the debugging is pointless if the OS isn't providing any indication of an issue.

The bug may appear again if it does I will continue my investigation for now all our network troubles 'should' be solved.

comment:49 Changed 3 years ago by verm

There have been no issues yet in my testing. I'm going to stop for now and re-start if anything comes up.

Note: See TracTickets for help on using tickets.