07 October, 2013

CoPP copp-s-glean and copp-s-arp dropping packets, ping loss and delay

Packet drop counters of copp-s-glean and copp-s-arp on Cisco Nexus 3000 series switches were rising significantly causing extra delay and loss of packets in a network.

So there must be a significant rise of arp requests/replies hitting the core routers in the network.
There also must be a significant rise of packets originating from some network (directly connected or not- no difference), but destined to a host (for which no record exists in the adjacency table) in a directly connected network. Previous experience with a similar problem was caused by a host from subnet X flooding a router with UDP packets destined to a router's directly connected subnet Y for the entire address range inside the subnet. This time it was not the case.

Observing the CoPP counters showed that the drop counters were rising only after every 85 seconds give or take (glean more than arp). Something in the network had to trigger the drops as the 85 second interval was  consistent. Wireshark captures confirmed that during the CoPP drops there was an increase in ARP packets. There was a massive amount of ARP requests generated by the core routers toward connected subnets. And of course there was an equally massive amount of ARP replies to those requests from hosts in said connected subnets. All the ARP requests were destined to existing live hosts and all the requests were replied back- meaning this was not a case of flooding packets to nonexisting hosts. However that is exactly what arp-glean drops are about. There must be a moment when the core switch doesnt have a valid record in the adjacency table to forward packets at.

Assume I have a massive amount of data flowing from network X to a directly connected network Y. And suddenly I clear my adjacency table of mac addresses. Upcoming packets to all different hosts in different subnets arrive and the router has a clean adjacency/arp table. Every different Layer3 destination address has to be resolved by an ARP request generated by the router. While the ARP request is generated and sent and then waiting for a reply, all subsequent packets to the same host address are dropped by the CoPP arp-glean policy to avoid generating even more ARP requests for that exact same Layer3 address. There is no point generating more ARP requests if we have already sent one out- it wouldnt speed up the process and would only put extra stress on the network. This was a scenario that could have been causing the arp-glean drops.

Now that the ARP replies were arriving back- that would put a lot of stress on the router's CPU. The problem is that the ARP replies were arriving all at the exact same time. This was possibly causing the copp-s-arp packet drops.

So basically the actual problem I had to resolve was the copp-s-glean packet drops since that was the main cause for copp-s-arp drops.

Looking at the sh mac address-table showed that the mac addresses' age was reset after every 80 or 90 seconds. Also the sh ip arp confirmed that the ARP records's age was reset every 85 seconds.
The aging time for mac address-table was set at default of 1800 seconds (sh mac address-table aging-time). The ARP timeout was at default of 1500 seconds.

So, for some reason the core router was flushing it's adjacency table every 85 seconds.
Googling around I found a hint that mac address flushing could be related to spanning tree topology changes.

Then I started investigating the spanning-tree history (sh spanning-tree internal event-history all brief)

On the core router I noticed the log:
2013:10:06 19h:55m:04s:199111us T_EV_M_FLUSH_R   VLAN0145 [6091.547f.eee8.8741 C 1 A  1 R Po23 P Po23]
2013:10:06 19h:55m:04s:909371us T_EV_M_FLUSH_R   VLAN0145 [6091.547f.eee8.8741 C 1 A  1 R Po23 P Po23]
2013:10:06 19h:56m:30s:955590us T_EV_M_FLUSH_R   VLAN0145 [6091.547f.eee8.8741 C 1 A  1 R Po23 P Po23]
2013:10:06 19h:56m:32s:955538us T_EV_M_FLUSH_R   VLAN0145 [6091.547f.eee8.8741 C 1 A  1 R Po23 P Po23]
2013:10:06 19h:57m:58s:951940us T_EV_M_FLUSH_R   VLAN0145 [6091.547f.eee8.8741 C 1 A  1 R Po23 P Po23]
2013:10:06 19h:58m:00s:952390us T_EV_M_FLUSH_R   VLAN0145 [6091.547f.eee8.8741 C 1 A  1 R Po23 P Po23]
2013:10:06 19h:59m:28s:888439us T_EV_M_FLUSH_R   VLAN0145 [6091.547f.eee8.8741 C 1 A  1 R Po23 P Po23]

The interesting part is T_EV_M_FLUSH_R indicating that the topology change message was received from a Remote switch. So i went through all the switches connected to the core router until I found a switch with following log:
2013:10:06 19h:55m:02s:930146us T_EV_M_FLUSH_L   VLAN0145 [6091.547f.eee8.8741 C 1 A  1 R Po153 P none]
2013:10:06 19h:56m:30s:178937us T_EV_M_FLUSH_L   VLAN0145 [6091.547f.eee8.8741 C 1 A  1 R Po153 P none]
2013:10:06 19h:57m:57s:691687us T_EV_M_FLUSH_L   VLAN0145 [6091.547f.eee8.8741 C 1 A  1 R Po153 P none]
2013:10:06 19h:59m:27s:938702us T_EV_M_FLUSH_L   VLAN0145 [6091.547f.eee8.8741 C 1 A  1 R Po153 P none]
2013:10:06 20h:00m:52s:450443us T_EV_M_FLUSH_L   VLAN0145 [6091.547f.eee8.8741 C 1 A  1 R Po153 P none]
2013:10:06 20h:02m:19s:703042us T_EV_M_FLUSH_L   VLAN0145 [6091.547f.eee8.8741 C 1 A  1 R Po153 P none]
2013:10:06 20h:03m:46s:962400us T_EV_M_FLUSH_L   VLAN0145 [6091.547f.eee8.8741 C 1 A  1 R Po153 P none]

T_EV_M_FLUSH_L indicates that the topology change message was received/generated Locally. So this must be the switch which is causing the Spanning tree topology changes. Digging further I gave a command sh spanning-tree detail | sec 0145 and the output was following:

VLAN0145 is executing the rstp compatible Spanning Tree protocol
  Bridge Identifier has priority 32768, sysid 239, address 547f.eef9.8c81
  Configured hello time 2, max age 20, forward delay 15
  Current root has priority 24815, address 547f.eee8.8741
  Root port is 4248 (port-channel153), cost of root path is 1
  Topology change flag set, detected flag not set
  Number of topology changes 3558 last change occurred 0:00:14 ago
          from Ethernet1/24

I can see that the last topology change was 14 seconds ago. And it would not go past 85 seconds. Also there seems to be a lot of topology changes- 3558. And all the topology changes seemed to be originating from a port Eth1/24. An ESX server was connected to that port that was causing constant topology changes in STP. And ultimately causing mac flushes which lead to massive ARP requests and replies slowing the network. After disabling the switch port eth1/24 the entire network cooled down- no more mac flushes, no more spanning tree topology changes. The final solution was defining ESX-connected interfaces as edge ports (portfast).

5 comments:

  1. There are many hacking techniques that are used for stealing the packets while
    they are transmitted so it is better to imply the best strategy that protects your
    packets.

    Thanks
    Silvester Norman

    Change MAC Address

    ReplyDelete
  2. Nice, thanks for writing this up

    ReplyDelete
  3. help (host from subnet X flooding a router with UDP packets destined to a router's directly connected subnet Y for the entire address range inside the subnet) how to solve?

    ReplyDelete