Thursday, October 15, 2015

Windows corrupting UDP datagrams


We just discovered that under a somewhat unlikely set of circumstances, Microsoft's Windows 7 (SP 1) will corrupt outgoing UDP datagrams.  I have a simple demonstration program ("rsend") which reliably reproduces the bug.  (I'll be pointing my Microsoft contact at this blog post.)

This bug was discovered by a customer, and we were able to reproduce it locally.  I wish I could take the credit, but my friend and colleague, Dave Zabel, did most of the detective work.  And amazing detective work it was!  But I'll leave that description for another day.  Let's concentrate on the bug.


CIRCUMSTANCES FOR THE BUG

1. UDP protocol.  (Duh!)
2. Multicast sends.  (Does not happen with unicast UDP.)
3. A process on the same machine must be joined to the same multicast group as being sent.
4. Window's IP MTU size set to smaller than the default 1500 (I tested 1300).
5. Sending datagrams large enough to require fragmentation by the reduced MTU, but still small enough *not* to require fragmentation with a 1500-byte MTU.

With that mix, you stand a good chance of the outgoing data having two bytes changed.  It seems to be somewhat dependent on the content of the datagram.  For example, a datagram consisting mostly of zeros doesn't seem to get corrupted.  But it's not that hard to find datagram content that *is* consistently corrupted, so my "rsend" demonstration program has one such datagram hard-coded.

Regarding #3, for convenience the rsend program contains code to join the multicast group, but I've also reproduced it without rsend joining, and instead running "mdump" in a different window.

Finally, be aware that I have not done a bunch of sensitivity testing.  I.e. I haven't tried different datagram sizes, different multicast groups, different MTU settings, jumbo frames, etc.  Nor did I try different versions of Windows (only 7), different NICs, etc.  Sorry, I don't have time to experiment.


BUG DEMONSTRATION

This procedure assumes that you have a Windows machine with its MTU at its default of 1500.  (You change it below.)

1. Build "rsend.c" on Windows with VS 2005.  Here's how I build it (from a Visual Studio command prompt):

  cl -D_MT -MD -DWIN32_LEAN_AND_MEAN -I. /Oi -Forsend.obj -c rsend.c

  link /OUT:rsend.exe ws2_32.lib mswsock.lib /MACHINE:I386 /SUBSYSTEM:console /NODEFAULTLIB:LIBCMT rsend.obj

  mt -manifest rsend.exe.manifest -outputresource:rsend.exe;1

2. Run the command, giving it the ip address of the windows machine's interface that you want the multicast to go out of.  For single-homed hosts, just give it the IP address of the machine.  For example:
    rsend 10.1.2.3
To make the tool easy to use, it hard codes the multicast group 239.196.2.128 and destination port 12000.

3. On a separate machine, do a packet capture for that multicast group.  Note that the packet capture utilities I know of (wireshark, tcpdump) do *not* tell the kernel to actually join the multicast group.  I generally deal with this using the "mdump" tool.  Run it in a separate window.  For example:
    mdump 239.196.2.128 12000

In the packet capture, look at the 1278th and 1279th bytes of the UDP datagram data: they should both be zero.  Here they are, with a few bytes preceding them:
    0x75,0x34,0x34,0xa4,0xc5,0xb4,0x00,0x00
NOTE: at this point, the datagram will fit in a single ethernet frame, so no IP fragmentation happens.

4.  While rsend is running, open a command prompt with administrator privilege (right-click on "command prompt" icon and select "run as administrator") and enter:
    netsh interface ipv4 set subinterface "Local Area Connection" mtu=1300 store=persistent

Like magic, bytes 1278 and 1279 of the outgoing UDP datagrams change their values!  Note that with an MTU of 1300, this UDP datagram now needs to be fragmented.  If using wireshark, you'll need to examine the *second* packet to see the entire UDP datagram and get to byte 1278.  I consistently see 0x62,0x27, but that seems to be dependent on datagram content as well.

5. Undo the MTU change:
    netsh interface ipv4 set subinterface "Local Area Connection" mtu=1500 store=persistent

Magically, the bytes go back to their correct values of 0x00,0x00.

Note: if you comment out the setsockopt of IP_ADD_MEMBERSHIP, the corruption will not happen.  The multicast datagrams will still go out, but they will be undamaged when the MTU is reduced.  The obvious suspect is the internal loopback.


SOLUTION

The only solution I know of is to leave the Windows IP MTU at its default of 1500.


WHY SET MTU 1300???

I don't know why our customer set it on one of his systems.  But he said that he would just set it back to 1500, so it must not have been importat.

If you google "windows set mtu size" you'll find people asking about it.  In many cases, the user is trying to reach a web site which is across a VPN or some other private WAN link which does not have an MTU of 1500.  The way TCP works is that it tries to send segments (a TCP segment is basically an IP datagram) as large as possible while avoiding IP fragmentation.  So a TCP instance sending data might start with a 1500-byte segment size.  If a network hop in-transit cannot handle a segment that large, it has a choice: either fragment it or reject it.  TCP explicitly sets an option to say, "do not fragment," so the network hop drops the segment.  It is supposed to return an ICMP error, which the sender's TCP instance will use to reduce its segment size.  This algorithm is known as TCP's "path MTU discovery".

But many network components either do not generate ICMP errors, or do not forward them.  This is supposedly done in the name of "security" (don't get me started).  This breaks path MTU discovery.  But the segments are still being dropped, so eventually the TCP sender times out and the web site doesn't work.  Apparently this is fairly rare, but it does happen.  Hence the  "set MTU" questions.  If the user reduces IP's MTU setting, it artificially reduces the maximum segment size used by TCP.  Do a bit of experimenting to find the right value, et Voila!  (French for "finally, I can download porn!")

So, how could Microsoft possibly not find this during their extensive testing?  Well first of all, UDP use is rare compared to TCP.  Multicast UDP is even more rare.  Sending UDP multicast datagrams larger than MTU is getting close to unicorn rare.  And doing all that with the IP MTU set to a non-standard value?  Heck, I consider myself to be a pretty rigorous tester, and I would never have tried that.


UDPATE:

Thanks to Mr. Anonymous for asking the question about NIC offloading.  We had considered the question previously (see my response in the comments), but in composing my response, I got to thinking about the offset of the corruption.

It's always in the second packet of the fragmented datagram, and always at 1278.  But that offset is with respect to the start of the UDP payload.  What is the offset with respect to the start of the second packet?  I didn't look at this before since Wireshark's ability to reassemble fragmented datagrams is so handy.  But I went ahead and clicked the "Frame" tab and saw that the corruption happens at offset 40 from the start of the packet.

Guess where the UDP checksum belongs in an UNfragmented datagram!  Yep, offset 40.  Something decided to take the second packet of the fragmented datagram and insert a separate UDP checksum where it *should* go if it were not a fragment.

This still seems like a software bug in Windows.  Sure, maybe the NIC is doing the actual calculation.  Maybe it's not.  But it only happens when IP is configured for a non-standard MTU.  If I have MTU=1500 and I send a fragmented datagram, there is no corruption.


UPDATE 2:

I did some experimenting with datagram size and verified something that I suspected.  When the MTU is set to 1300, the corruption only happens when the datagram size is such that a 1500-byte MTU would *not* fragment but a 1300-byte MTU does.  I.e. there is a size range of 200 bytes (the difference between 1300 and 1500).   This is another reason Microsoft's testers apparently didn't discover this.  Even if they tested fragmentation with non-standard MTUs, would they think to test a size in that specific range?  With the benefit of hindsight, sure, it's "obvious".  But if you're just testing combinations of configurations, you would just pick the "send fragments" combination, which is probably chosen to fragment with MTU 1500.  (FYI: I've updated the original post to refine the conditions of the bug.)

I'm normally not a Microsoft cheerleader, so it feels weird to be defending them on this bug.  :-)


UPDATE 3:

Since we noticed that the corruption always happens at offset 40 in the second packet, I decreased the size of the datagram to only include half of the corrupted pair.  Sure enough, the last byte of the datagram got corrupted.  An the second corrupted byte?  Who knows.  I kind of hoped it would corrupt something in Windows and maybe blue-screen it, but no such luck.  I didn't "see" any misbehavior.

Does that mean there *was* no misbehavior?  NO!  The outgoing datagrams suddenly had bad checksums!  Meaning that the mdump tool stopped receiving them since Linux discards datagrams with bad checksums.  But tcpdump captures the packets *before* UDP discards them, so you can see the bad checksums.

I kept decreasing the size of the datagram till it was 1273 bytes.  That still triggers fragmentation when MTU=1300.  The outgoing datagrams had no visible corruption but had bad checksums.  Reduce one more byte, and the datagram fits in one packet.  Suddenly the checksums are OK.

I tried a few things, like sending packets hard, and varying their sizes, but other than the bad checksums I could not see any obvious Windows misbehavior.

I guess my days as a white-hat hacker are over before they started.  (Did I get the tenses right on that sentence?)

Well, I think I'm done experimenting.  If anybody else reproduces it, please let me know your Windows version.


UPDATE 4:

I heard back from my contact at MS.  He said:

We've looked into this, and see what is happening.  If the customer needs to pursue this rather than using a work around (e.g. not setting the MTU size on the loopback path to a different size than the non-loopback interface, etc.) they will need to open a support ticket.  Thank you for letting me know about this."
Which I suspect translates to, "We'll fix it in a future version, but not urgently.  If you need urgency, pay for support."  :-)


REDDIT:

Finally, Hi Reddit users!  Thanks for pushing the hits on this post to many times the total hit count for the whole rest of the blog.  :-)  I read the comments and saw that my first update had already been noticed by somebody else.

Also, something a lot of Reddit comments have fixated on is my claim that UDP multicast is rare.  I meant that the number of programs (and programmers) that use it is very small compared to all software, not that multicast is hardly ever used.  As pointed out, there are several areas of network infrastructure which are multicast-based, so it gets used all the time.  My point is that the number of programmer-hours spent *writing and testing* multicast-based software is very small compared to the overall networking software field.  And as such, it tends not to be as burned-in as, say, TCP.

Also, in most multicast software that I have learned the guts of, the programmer makes sure that datagram sizes are kept small so as to avoid fragmentation.  This seems to be due to the commonly-held idea that you should *never* let IP fragment, which I think comes from the fact that, at least historically, router performance is hurt if it has to perform fragmentation while a datagram is in transit.  I'm not sure if this is still true for modern routers, but historically fragmentation needed to be handled by the supervisory processor.  For the odd packet every now and then, no problem.  For high-rate data flows, it can kill a router.

That seems to be the basis on which a lot of multicast software avoids fragmentation, preferring instead to split large messages into multiple datagrams.  But this reasoning is often not applicable.  Our software intended primarily to be used within a single data center.  When we send a 2K datagram, no router needs to worry about fragmenting it; the sending host's IP stack splits the datagram into packets before they hit the wire.  The intermediate switches and routers all have 1500 MTU, allowing the packets to traverse unmolested.  The final receiving host(s) reassemble and pass the datagram to user space.  This has a noticeable advantage for high-performance applications since the same amount of user data is passed with fewer system calls (the overhead of switching between user and kernel space is significant).

So while I'm sure our software is not alone in sending fragmented multicast datagrams, I stand behind my claim that sending fragmented multicast is relatively rare.