The Meditative Coder: fragmentation

UPDATE: after upgrading the IOS our "MDF" switch, this problem went away. None of my readers (all 2 of them?) have reported seeing this problem with their switches. So I think this issue is closed.

I think we've discovered a bug in our Cisco switch related to UDP multicast and IP fragmentation. Dave Zabel (of Windows corrupting UDP fame) did the initial detective work, and I did most of the analysis. And I'm not quite ready to declare victory yet, but I'm pretty sure we know roughly what is going on.

BOTTOM LINE:

It appears that Cisco is not paying proper attention to whether a packet is fragmented when checking the UDP destination port for the BFD protocol. The result is that it eats user packets that it misidentifies as being part of that protocol.

THE SETUP:

We have 4 Catalyst 3560 "LAB" switches (48 port) trunked to a Catalyst 4507 "MDF" switch. Our lab test machines are distributed across the LAB switches.

Our messaging software multicasts UDP datagrams. One of our regression tests involves sending messages of varying sizes with randomized data. We saw that occasionally, one of the messages would be lost. Doing packet captures showed that the missing datagram is NAKed and retransmitted multiple times, but the subscribing host never saw the datagram, even though it saw all the previous and subsequent datagrams. (This particular test does not send at a particularly stressful rate.)

Further investigation showed that some hosts always got the message in question, while others never got the message. Turns out that the hosts that got the message were on the same LAB switch as the sender. The hosts that didn't get the message were on a different switch.

I narrowed it down to a minimal test datagram of 1476 bytes. The first 1474 bytes can be any arbitrary values, but the last two bytes had to be either "0e c8" or "0e c9". Any datagram with either of those two problematic byte pairs at that offset will be lost. Note that the datagram will be split into 2 packets (IP fragments) by the sending host's IP stack. Strategically placed tcpdumps indicated that the first IP fragment always makes it to the receiver, but the second one seems to be eaten by our "MDF" switch.

There's nothing magic about the size 1476 - it can be larger and the problem still happens. 1476 is just the smallest datagram which demonstrates the problem.

IP FRAGMENTATION:

IP fragmentation happens when UDP hands to IP a datagram that doesn't fit into a single MTU-sized Ethernet packet (1500 bytes). A UDP datagram consists of an 8-byte header, followed by up to 65,527 bytes of UDP payload. IP splits a large datagram up into fragments of 1480 bytes each and prepends its own 20-byte IP header to each fragment. But note that only the first fragment will contain the UDP header. So IP fragment #1 will hold the 8-byte UDP header and the first 1472 bytes of my datagram.

Since my test datagram is 1476 bytes long, IP fragment #2 will contain a 20-byte IP header followed by the last 4 bytes of my datagram.

I won't show you the first fragment of my test datagram because it's long and boring. And it is successfully handled by Cisco, so it's also not relevant.

Here's a tcpdump of the second fragment of my test datagram (test datagram bytes highlighted). Note that tcpdump includes a 14-byte Ethernet header in front of the 20-byte IP header, then the last 4 bytes of my test datagram, and finally 22 padding nulls to make up a minimum-size packet (those nulls are not counted as part of the IP payload).

07:56:38.518614 00:1e:c9:4e:a1:92 (oui Unknown) > 01:00:5e:65:03:01 (oui Unknown), ethertype IPv4 (0x0800), length 60: (tos 0x0, ttl 2, id 2132, offset 1480, flags [none], proto: UDP (17), length: 24) 10.29.3.88 > 239.101.3.1: udp

0x0000: 0100 5e65 0301 001e c94e a192 0800 4500 ..^e.....N....E.

0x0010: 0018 0854 00b9 0211 afed 0a1d 0358 ef65 ...T.........X.e

0x0020: 0301 0000 0ec8 0000 0000 0000 0000 0000 ................

0x0030: 0000 0000 0000 0000 0000 0000 ............

This is the packet which is successfully received by hosts on the same switch as the sender, but is never received by hosts on a different switch. Change the "0e c8" byte pair to, for example, "1e c8" or "0e c7" and everything works fine - the packet is properly forwarded.

A CASE OF MISTAKEN IDENTITY?

In my problematic datagram, the last 4 bytes occupy the same packet position in fragment #2 as the UDP header in a non-fragmented packet. In particular, the byte pair "0e c8" occupies the same packet position as the UDP destination port in a non-fragmented packet. Those byte values correspond to port 3784, which is used by the BFD protocol. BFD is used to quickly detect failures in the path between adjacent forwarding switches and routers, so it is of special interest to our switches. (The other problematic byte pair "0e c9" corresponds to port 3785, which is also used by BFD.)

So, when a LAB switch sends fragment #2 to the MDF, it looks like MDF is checking the UDP port WITHOUT looking at the IP header's "Fragment Offset" field. It should only look for UDP port if the fragment offset is zero. Here's that packet again with the fragment offset highlighted:

0x0000: 0100 5e65 0301 001e c94e a192 0800 4500 ..^e.....N....E.

0x0010: 0018 0854 00b9 0211 afed 0a1d 0358 ef65 ...T.........X.e

0x0020: 0301 0000 0ec8 0000 0000 0000 0000 0000 ................

0x0030: 0000 0000 0000 0000 0000 0000 ............

For most (non-fragmented) packets, that byte will be zero, and the UDP header will be present, in which case the 0ec8 would be the port number. The highlighted fragment offset of b9 hex is 185 decimal, and IP fragment offset is measured in units of 8-byte blocks, so the actual offset is 8*185=1480, which is tcpdump has for "offset".

It also seems strange to me that the switch ignores which multicast group I'm sending to. I can send to any valid multicast group, and the problematic packet will be eaten by the "MDF" switch. Shouldn't there be a specific multicast group for BFD? Maybe I found 2 bugs?

My employer has a support contract with Cisco, and I'm working with the internal network group to get a Cisco ticket opened. I'll update as I learn more, but it's slow climbing through the various levels of internal and external tech support, each one of whom starts out with, "are you sure it's plugged in?" It may take weeks to find somebody who even knows what IP fragmentation is.

TRY IT YOURSELF

I would love to hear from others who can try this out on their own networks. Grab the source files:

To build on Linux do:
gcc -o msend msend.c
gcc -o mdump mdump.c

Note that I've tried other operating systems (Widows and Solaris), with the same test results. This is not an OS issue.

For this test, the main purpose of mdump is to get the host to join the multicast group.

Choose three hosts: A, B, and C. Make sure A and B are on the same switch, and C is on a different switch. In my case, all three hosts are on the same VLAN; I don't know if that is significant. For this example, let's assume that the three hosts' IP addresses are 10.29.1.1, 10.29.1.2, and 10.29.1.3 respectively, and that all NICs are named "eth0".

Choose a multicast group and UDP port that aren't being used in your network. I chose 239.101.3.1 and 12000. I've tried others as well, with the same test results.

Note that the msend and mdump commands require you to put the hosts's primary IP address as the 3rd command-line parameter. This is because multicast needs to be told explicitly which interface to use (normal IP routing doesn't know the "right" interface to use).

Open a window to A, and two windows each for B and C. Enter the following commands:

B1: ./mdump 239.101.3.1 12000 10.29.1.2

B2: tcpdump -i eth0 -s2000 -vvv -XX -e host 239.101.3.1

C1: ./mdump 239.101.3.1 12000 10.29.1.3

C2: tcpdump -i eth0 -s2000 -vvv -XX -e host 239.101.3.1

A: ./msend 239.101.3.1 12000 10.29.1.1

The "msend" command sends two datagrams. The first one is small and gives the sending host's name. The second one is the 1476-byte datagram, whose second fragment gets eaten by the Cisco "MDF" switch.

Window B1 should show both datagrams fully received.

B2 should show 3 packets:

1. The short packet with the host name.

2. Fragment #1 of the long packet

3. Fragment #2 of the long packet

C1 should only show the first datagram.

C2 should show 2 packets:

1. The short packet with the host name.

2. Fragment #1 of the long packet.

Fragment #2 is missing from C2, presumably eaten by the "MDF" switch.

Note that the two "tcpdump" windows might show additional packets, which are for the "igmp" protocol, and are unrelated to the test. If I had more time, I would figure out how to get "tcpdump" to ignore them.

We just discovered that under a somewhat unlikely set of circumstances, Microsoft's Windows 7 (SP 1) will corrupt outgoing UDP datagrams. I have a simple demonstration program ("rsend") which reliably reproduces the bug. (I'll be pointing my Microsoft contact at this blog post.)

This bug was discovered by a customer, and we were able to reproduce it locally. I wish I could take the credit, but my friend and colleague, Dave Zabel, did most of the detective work. And amazing detective work it was! But I'll leave that description for another day. Let's concentrate on the bug.

CIRCUMSTANCES FOR THE BUG

1. UDP protocol. (Duh!)
2. Multicast sends. (Does not happen with unicast UDP.)
3. A process on the same machine must be joined to the same multicast group as being sent.
4. Window's IP MTU size set to smaller than the default 1500 (I tested 1300).

5. Sending datagrams large enough to require fragmentation by the reduced MTU, but still small enough *not* to require fragmentation with a 1500-byte MTU.

With that mix, you stand a good chance of the outgoing data having two bytes changed. It seems to be somewhat dependent on the content of the datagram. For example, a datagram consisting mostly of zeros doesn't seem to get corrupted. But it's not that hard to find datagram content that *is* consistently corrupted, so my "rsend" demonstration program has one such datagram hard-coded.

Regarding #3, for convenience the rsend program contains code to join the multicast group, but I've also reproduced it without rsend joining, and instead running "mdump" in a different window.

Finally, be aware that I have not done a bunch of sensitivity testing. I.e. I haven't tried different datagram sizes, different multicast groups, different MTU settings, jumbo frames, etc. Nor did I try different versions of Windows (only 7), different NICs, etc. Sorry, I don't have time to experiment.

BUG DEMONSTRATION

This procedure assumes that you have a Windows machine with its MTU at its default of 1500. (You change it below.)

1. Build "rsend.c" on Windows with VS 2005. Here's how I build it (from a Visual Studio command prompt):

cl -D_MT -MD -DWIN32_LEAN_AND_MEAN -I. /Oi -Forsend.obj -c rsend.c

link /OUT:rsend.exe ws2_32.lib mswsock.lib /MACHINE:I386 /SUBSYSTEM:console /NODEFAULTLIB:LIBCMT rsend.obj

mt -manifest rsend.exe.manifest -outputresource:rsend.exe;1

2. Run the command, giving it the ip address of the windows machine's interface that you want the multicast to go out of. For single-homed hosts, just give it the IP address of the machine. For example:
rsend 10.1.2.3
To make the tool easy to use, it hard codes the multicast group 239.196.2.128 and destination port 12000.

3. On a separate machine, do a packet capture for that multicast group. Note that the packet capture utilities I know of (wireshark, tcpdump) do *not* tell the kernel to actually join the multicast group. I generally deal with this using the "mdump" tool. Run it in a separate window. For example:
mdump 239.196.2.128 12000

In the packet capture, look at the 1278th and 1279th bytes of the UDP datagram data: they should both be zero. Here they are, with a few bytes preceding them:
0x75,0x34,0x34,0xa4,0xc5,0xb4,0x00,0x00
NOTE: at this point, the datagram will fit in a single ethernet frame, so no IP fragmentation happens.

4. While rsend is running, open a command prompt with administrator privilege (right-click on "command prompt" icon and select "run as administrator") and enter:
netsh interface ipv4 set subinterface "Local Area Connection" mtu=1300 store=persistent

Like magic, bytes 1278 and 1279 of the outgoing UDP datagrams change their values! Note that with an MTU of 1300, this UDP datagram now needs to be fragmented. If using wireshark, you'll need to examine the *second* packet to see the entire UDP datagram and get to byte 1278. I consistently see 0x62,0x27, but that seems to be dependent on datagram content as well.

5. Undo the MTU change:
netsh interface ipv4 set subinterface "Local Area Connection" mtu=1500 store=persistent

Magically, the bytes go back to their correct values of 0x00,0x00.

Note: if you comment out the setsockopt of IP_ADD_MEMBERSHIP, the corruption will not happen. The multicast datagrams will still go out, but they will be undamaged when the MTU is reduced. The obvious suspect is the internal loopback.

SOLUTION

The only solution I know of is to leave the Windows IP MTU at its default of 1500.

WHY SET MTU 1300???

I don't know why our customer set it on one of his systems. But he said that he would just set it back to 1500, so it must not have been importat.

If you google "windows set mtu size" you'll find people asking about it. In many cases, the user is trying to reach a web site which is across a VPN or some other private WAN link which does not have an MTU of 1500. The way TCP works is that it tries to send segments (a TCP segment is basically an IP datagram) as large as possible while avoiding IP fragmentation. So a TCP instance sending data might start with a 1500-byte segment size. If a network hop in-transit cannot handle a segment that large, it has a choice: either fragment it or reject it. TCP explicitly sets an option to say, "do not fragment," so the network hop drops the segment. It is supposed to return an ICMP error, which the sender's TCP instance will use to reduce its segment size. This algorithm is known as TCP's "path MTU discovery".

But many network components either do not generate ICMP errors, or do not forward them. This is supposedly done in the name of "security" (don't get me started). This breaks path MTU discovery. But the segments are still being dropped, so eventually the TCP sender times out and the web site doesn't work. Apparently this is fairly rare, but it does happen. Hence the "set MTU" questions. If the user reduces IP's MTU setting, it artificially reduces the maximum segment size used by TCP. Do a bit of experimenting to find the right value, et Voila! (French for "finally, I can download porn!")

So, how could Microsoft possibly not find this during their extensive testing? Well first of all, UDP use is rare compared to TCP. Multicast UDP is even more rare. Sending UDP multicast datagrams larger than MTU is getting close to unicorn rare. And doing all that with the IP MTU set to a non-standard value? Heck, I consider myself to be a pretty rigorous tester, and I would never have tried that.

UDPATE:

Thanks to Mr. Anonymous for asking the question about NIC offloading. We had considered the question previously (see my response in the comments), but in composing my response, I got to thinking about the offset of the corruption.

It's always in the second packet of the fragmented datagram, and always at 1278. But that offset is with respect to the start of the UDP payload. What is the offset with respect to the start of the second packet? I didn't look at this before since Wireshark's ability to reassemble fragmented datagrams is so handy. But I went ahead and clicked the "Frame" tab and saw that the corruption happens at offset 40 from the start of the packet.

Guess where the UDP checksum belongs in an UNfragmented datagram! Yep, offset 40. Something decided to take the second packet of the fragmented datagram and insert a separate UDP checksum where it *should* go if it were not a fragment.

This still seems like a software bug in Windows. Sure, maybe the NIC is doing the actual calculation. Maybe it's not. But it only happens when IP is configured for a non-standard MTU. If I have MTU=1500 and I send a fragmented datagram, there is no corruption.

UPDATE 2:

I did some experimenting with datagram size and verified something that I suspected. When the MTU is set to 1300, the corruption only happens when the datagram size is such that a 1500-byte MTU would *not* fragment but a 1300-byte MTU does. I.e. there is a size range of 200 bytes (the difference between 1300 and 1500). This is another reason Microsoft's testers apparently didn't discover this. Even if they tested fragmentation with non-standard MTUs, would they think to test a size in that specific range? With the benefit of hindsight, sure, it's "obvious". But if you're just testing combinations of configurations, you would just pick the "send fragments" combination, which is probably chosen to fragment with MTU 1500. (FYI: I've updated the original post to refine the conditions of the bug.)

I'm normally not a Microsoft cheerleader, so it feels weird to be defending them on this bug. :-)

UPDATE 3:

Since we noticed that the corruption always happens at offset 40 in the second packet, I decreased the size of the datagram to only include half of the corrupted pair. Sure enough, the last byte of the datagram got corrupted. An the second corrupted byte? Who knows. I kind of hoped it would corrupt something in Windows and maybe blue-screen it, but no such luck. I didn't "see" any misbehavior.

Does that mean there *was* no misbehavior? NO! The outgoing datagrams suddenly had bad checksums! Meaning that the mdump tool stopped receiving them since Linux discards datagrams with bad checksums. But tcpdump captures the packets *before* UDP discards them, so you can see the bad checksums.

I kept decreasing the size of the datagram till it was 1273 bytes. That still triggers fragmentation when MTU=1300. The outgoing datagrams had no visible corruption but had bad checksums. Reduce one more byte, and the datagram fits in one packet. Suddenly the checksums are OK.

I tried a few things, like sending packets hard, and varying their sizes, but other than the bad checksums I could not see any obvious Windows misbehavior.

I guess my days as a white-hat hacker are over before they started. (Did I get the tenses right on that sentence?)

Well, I think I'm done experimenting. If anybody else reproduces it, please let me know your Windows version.

UPDATE 4:

I heard back from my contact at MS. He said:

We've looked into this, and see what is happening. If the customer needs to pursue this rather than using a work around (e.g. not setting the MTU size on the loopback path to a different size than the non-loopback interface, etc.) they will need to open a support ticket. Thank you for letting me know about this."

Which I suspect translates to, "We'll fix it in a future version, but not urgently. If you need urgency, pay for support." :-)

REDDIT:

Finally, Hi Reddit users! Thanks for pushing the hits on this post to many times the total hit count for the whole rest of the blog. :-) I read the comments and saw that my first update had already been noticed by somebody else.

Also, something a lot of Reddit comments have fixated on is my claim that UDP multicast is rare. I meant that the number of programs (and programmers) that use it is very small compared to all software, not that multicast is hardly ever used. As pointed out, there are several areas of network infrastructure which are multicast-based, so it gets used all the time. My point is that the number of programmer-hours spent *writing and testing* multicast-based software is very small compared to the overall networking software field. And as such, it tends not to be as burned-in as, say, TCP.

Also, in most multicast software that I have learned the guts of, the programmer makes sure that datagram sizes are kept small so as to avoid fragmentation. This seems to be due to the commonly-held idea that you should *never* let IP fragment, which I think comes from the fact that, at least historically, router performance is hurt if it has to perform fragmentation while a datagram is in transit. I'm not sure if this is still true for modern routers, but historically fragmentation needed to be handled by the supervisory processor. For the odd packet every now and then, no problem. For high-rate data flows, it can kill a router.

That seems to be the basis on which a lot of multicast software avoids fragmentation, preferring instead to split large messages into multiple datagrams. But this reasoning is often not applicable. Our software intended primarily to be used within a single data center. When we send a 2K datagram, no router needs to worry about fragmenting it; the sending host's IP stack splits the datagram into packets before they hit the wire. The intermediate switches and routers all have 1500 MTU, allowing the packets to traverse unmolested. The final receiving host(s) reassemble and pass the datagram to user space. This has a noticeable advantage for high-performance applications since the same amount of user data is passed with fewer system calls (the overhead of switching between user and kernel space is significant).

So while I'm sure our software is not alone in sending fragmented multicast datagrams, I stand behind my claim that sending fragmented multicast is relatively rare.

The Meditative Coder

Friday, March 31, 2017

Cisco Eating Multicast Fragments???

Thursday, October 15, 2015

Windows corrupting UDP datagrams

About Me

Tags (see here)

Blog Archive