Friday, March 31, 2017

Cisco Eating Multicast Fragments???


UPDATE: after upgrading the IOS our "MDF" switch, this problem went away.  None of my readers (all 2 of them?) have reported seeing this problem with their switches.  So I think this issue is closed.


I think we've discovered a bug in our Cisco switch related to UDP multicast and IP fragmentation.  Dave Zabel (of Windows corrupting UDP fame) did the initial detective work, and I did most of the analysis.  And I'm not quite ready to declare victory yet, but I'm pretty sure we know roughly what is going on.


BOTTOM LINE:

It appears that Cisco is not paying proper attention to whether a packet is fragmented when checking the UDP destination port for the BFD protocol.  The result is that it eats user packets that it misidentifies as being part of that protocol.


THE SETUP:

We have 4 Catalyst 3560 "LAB" switches (48 port) trunked to a Catalyst 4507 "MDF" switch.  Our lab test machines are distributed across the LAB switches.

Our messaging software multicasts UDP datagrams.  One of our regression tests involves sending messages of varying sizes with randomized data.  We saw that occasionally, one of the messages would be lost.  Doing packet captures showed that the missing datagram is NAKed and retransmitted multiple times, but the subscribing host never saw the datagram, even though it saw all the previous and subsequent datagrams.  (This particular test does not send at a particularly stressful rate.)

Further investigation showed that some hosts always got the message in question, while others never got the message.  Turns out that the hosts that got the message were on the same LAB switch as the sender.  The hosts that didn't get the message were on a different switch.

I narrowed it down to a minimal test datagram of 1476 bytes.  The first 1474 bytes can be any arbitrary values, but the last two bytes had to be either "0e c8" or "0e c9".  Any datagram with either of those two problematic byte pairs at that offset will be lost.  Note that the datagram will be split into 2 packets (IP fragments) by the sending host's IP stack.  Strategically placed tcpdumps indicated that the first IP fragment always makes it to the receiver, but the second one seems to be eaten by our "MDF" switch.

There's nothing magic about the size 1476 - it can be larger and the problem still happens.  1476 is just the smallest datagram which demonstrates the problem.


IP FRAGMENTATION:

IP fragmentation happens when UDP hands to IP a datagram that doesn't fit into a single MTU-sized Ethernet packet (1500 bytes).  A UDP datagram consists of an 8-byte header, followed by up to 65,527 bytes of UDP payload.  IP splits a large datagram up into fragments of 1480 bytes each and prepends its own 20-byte IP header to each fragment.  But note that only the first fragment will contain the UDP header.  So IP fragment #1 will hold the 8-byte UDP header and the first 1472 bytes of my datagram.

Since my test datagram is 1476 bytes long, IP fragment #2 will contain a 20-byte IP header followed by the last 4 bytes of my datagram.

I won't show you the first fragment of my test datagram because it's long and boring.  And it is successfully handled by Cisco, so it's also not relevant.

Here's a tcpdump of the second fragment of my test datagram (test datagram bytes highlighted).  Note that tcpdump includes a 14-byte Ethernet header in front of the 20-byte IP header, then the last 4 bytes of my test datagram, and finally 22 padding nulls to make up a minimum-size packet (those nulls are not counted as part of the IP payload).

07:56:38.518614 00:1e:c9:4e:a1:92 (oui Unknown) > 01:00:5e:65:03:01 (oui Unknown), ethertype IPv4 (0x0800), length 60: (tos 0x0, ttl   2, id 2132, offset 1480, flags [none], proto: UDP (17), length: 24) 10.29.3.88 > 239.101.3.1: udp
        0x0000:  0100 5e65 0301 001e c94e a192 0800 4500  ..^e.....N....E.
        0x0010:  0018 0854 00b9 0211 afed 0a1d 0358 ef65  ...T.........X.e
        0x0020:  0301 0000 0ec8 0000 0000 0000 0000 0000  ................
        0x0030:  0000 0000 0000 0000 0000 0000            ............

This is the packet which is successfully received by hosts on the same switch as the sender, but is never received by hosts on a different switch.  Change the "0e c8" byte pair to, for example, "1e c8" or "0e c7" and everything works fine - the packet is properly forwarded.


A CASE OF MISTAKEN IDENTITY?

In my problematic datagram, the last 4 bytes occupy the same packet position in fragment #2 as the UDP header in a non-fragmented packet.  In particular, the byte pair "0e c8" occupies the same packet position as the UDP destination port in a non-fragmented packet.  Those byte values correspond to port 3784, which is used by the BFD protocol.  BFD is used to quickly detect failures in the path between adjacent forwarding switches and routers, so it is of special interest to our switches.  (The other problematic byte pair "0e c9" corresponds to port 3785, which is also used by BFD.)

So, when a LAB switch sends fragment #2 to the MDF, it looks like MDF is checking the UDP port WITHOUT looking at the IP header's "Fragment Offset" field.  It should only look for UDP port if the fragment offset is zero.  Here's that packet again with the fragment offset highlighted:

07:56:38.518614 00:1e:c9:4e:a1:92 (oui Unknown) > 01:00:5e:65:03:01 (oui Unknown), ethertype IPv4 (0x0800), length 60: (tos 0x0, ttl   2, id 2132, offset 1480, flags [none], proto: UDP (17), length: 24) 10.29.3.88 > 239.101.3.1: udp
        0x0000:  0100 5e65 0301 001e c94e a192 0800 4500  ..^e.....N....E.
        0x0010:  0018 0854 00b9 0211 afed 0a1d 0358 ef65  ...T.........X.e
        0x0020:  0301 0000 0ec8 0000 0000 0000 0000 0000  ................
        0x0030:  0000 0000 0000 0000 0000 0000            ............

For most (non-fragmented) packets, that byte will be zero, and the UDP header will be present, in which case the 0ec8 would be the port number.  The highlighted fragment offset of b9 hex is 185 decimal, and IP fragment offset is measured in units of 8-byte blocks, so the actual offset is 8*185=1480, which is tcpdump has for "offset".

It also seems strange to me that the switch ignores which multicast group I'm sending to.  I can send to any valid multicast group, and the problematic packet will be eaten by the "MDF" switch.  Shouldn't there be a specific multicast group for BFD?  Maybe I found 2 bugs?

My employer has a support contract with Cisco, and I'm working with the internal network group to get a Cisco ticket opened.  I'll update as I learn more, but it's slow climbing through the various levels of internal and external tech support, each one of whom starts out with, "are you sure it's plugged in?"  It may take weeks to find somebody who even knows what IP fragmentation is.


TRY IT YOURSELF

I would love to hear from others who can try this out on their own networks.  Grab the source files:


To build on Linux do:
gcc -o msend msend.c
gcc -o mdump mdump.c

Note that I've tried other operating systems (Widows and Solaris), with the same test results.  This is not an OS issue.

For this test, the main purpose of mdump is to get the host to join the multicast group.

Choose three hosts: A, B, and C.  Make sure A and B are on the same switch, and C is on a different switch.  In my case, all three hosts are on the same VLAN; I don't know if that is significant.  For this example, let's assume that the three hosts' IP addresses are 10.29.1.1, 10.29.1.2, and 10.29.1.3 respectively, and that all NICs are named "eth0".

Choose a multicast group and UDP port that aren't being used in your network.  I chose 239.101.3.1 and 12000.  I've tried others as well, with the same test results.

Note that the msend and mdump commands require you to put the hosts's primary IP address as the 3rd command-line parameter.  This is because multicast needs to be told explicitly which interface to use (normal IP routing doesn't know the "right" interface to use).

Open a window to A, and two windows each for B and C.  Enter the following commands:

B1: ./mdump  239.101.3.1 12000 10.29.1.2

B2: tcpdump -i eth0 -s2000 -vvv -XX -e host 239.101.3.1

C1: ./mdump  239.101.3.1 12000 10.29.1.3

C2: tcpdump -i eth0 -s2000 -vvv -XX -e host 239.101.3.1

A: ./msend 239.101.3.1 12000 10.29.1.1

The "msend" command sends two datagrams.  The first one is small and gives the sending host's name.  The second one is the 1476-byte datagram, whose second fragment gets eaten by the Cisco "MDF" switch.

Window B1 should show both datagrams fully received.

B2 should show 3 packets:
1. The short packet with the host name.
2. Fragment #1 of the long packet
3. Fragment #2 of the long packet

C1 should only show the first datagram.

C2 should show 2 packets:
1. The short packet with the host name.
2. Fragment #1 of the long packet.

Fragment #2 is missing from C2, presumably eaten by the "MDF" switch.

Note that the two "tcpdump" windows might show additional packets, which are for the "igmp" protocol, and are unrelated to the test.  If I had more time, I would figure out how to get "tcpdump" to ignore them.