Thursday, October 15, 2015

Windows corrupting UDP datagrams

We just discovered that under a somewhat unlikely set of circumstances, Microsoft's Windows 7 (SP 1) will corrupt outgoing UDP datagrams.  I have a simple demonstration program ("rsend") which reliably reproduces the bug.  (I'll be pointing my Microsoft contact at this blog post.)

This bug was discovered by a customer, and we were able to reproduce it locally.  I wish I could take the credit, but my friend and colleague, Dave Zabel, did most of the detective work.  And amazing detective work it was!  But I'll leave that description for another day.  Let's concentrate on the bug.


1. UDP protocol.  (Duh!)
2. Multicast sends.  (Does not happen with unicast UDP.)
3. A process on the same machine must be joined to the same multicast group as being sent.
4. Window's IP MTU size set to smaller than the default 1500 (I tested 1300).
5. Sending datagrams large enough to require fragmentation by the reduced MTU, but still small enough *not* to require fragmentation with a 1500-byte MTU.

With that mix, you stand a good chance of the outgoing data having two bytes changed.  It seems to be somewhat dependent on the content of the datagram.  For example, a datagram consisting mostly of zeros doesn't seem to get corrupted.  But it's not that hard to find datagram content that *is* consistently corrupted, so my "rsend" demonstration program has one such datagram hard-coded.

Regarding #3, for convenience the rsend program contains code to join the multicast group, but I've also reproduced it without rsend joining, and instead running "mdump" in a different window.

Finally, be aware that I have not done a bunch of sensitivity testing.  I.e. I haven't tried different datagram sizes, different multicast groups, different MTU settings, jumbo frames, etc.  Nor did I try different versions of Windows (only 7), different NICs, etc.  Sorry, I don't have time to experiment.


This procedure assumes that you have a Windows machine with its MTU at its default of 1500.  (You change it below.)

1. Build "rsend.c" on Windows with VS 2005.  Here's how I build it (from a Visual Studio command prompt):

  cl -D_MT -MD -DWIN32_LEAN_AND_MEAN -I. /Oi -Forsend.obj -c rsend.c

  link /OUT:rsend.exe ws2_32.lib mswsock.lib /MACHINE:I386 /SUBSYSTEM:console /NODEFAULTLIB:LIBCMT rsend.obj

  mt -manifest rsend.exe.manifest -outputresource:rsend.exe;1

2. Run the command, giving it the ip address of the windows machine's interface that you want the multicast to go out of.  For single-homed hosts, just give it the IP address of the machine.  For example:
To make the tool easy to use, it hard codes the multicast group and destination port 12000.

3. On a separate machine, do a packet capture for that multicast group.  Note that the packet capture utilities I know of (wireshark, tcpdump) do *not* tell the kernel to actually join the multicast group.  I generally deal with this using the "mdump" tool.  Run it in a separate window.  For example:
    mdump 12000

In the packet capture, look at the 1278th and 1279th bytes of the UDP datagram data: they should both be zero.  Here they are, with a few bytes preceding them:
NOTE: at this point, the datagram will fit in a single ethernet frame, so no IP fragmentation happens.

4.  While rsend is running, open a command prompt with administrator privilege (right-click on "command prompt" icon and select "run as administrator") and enter:
    netsh interface ipv4 set subinterface "Local Area Connection" mtu=1300 store=persistent

Like magic, bytes 1278 and 1279 of the outgoing UDP datagrams change their values!  Note that with an MTU of 1300, this UDP datagram now needs to be fragmented.  If using wireshark, you'll need to examine the *second* packet to see the entire UDP datagram and get to byte 1278.  I consistently see 0x62,0x27, but that seems to be dependent on datagram content as well.

5. Undo the MTU change:
    netsh interface ipv4 set subinterface "Local Area Connection" mtu=1500 store=persistent

Magically, the bytes go back to their correct values of 0x00,0x00.

Note: if you comment out the setsockopt of IP_ADD_MEMBERSHIP, the corruption will not happen.  The multicast datagrams will still go out, but they will be undamaged when the MTU is reduced.  The obvious suspect is the internal loopback.


The only solution I know of is to leave the Windows IP MTU at its default of 1500.

WHY SET MTU 1300???

I don't know why our customer set it on one of his systems.  But he said that he would just set it back to 1500, so it must not have been importat.

If you google "windows set mtu size" you'll find people asking about it.  In many cases, the user is trying to reach a web site which is across a VPN or some other private WAN link which does not have an MTU of 1500.  The way TCP works is that it tries to send segments (a TCP segment is basically an IP datagram) as large as possible while avoiding IP fragmentation.  So a TCP instance sending data might start with a 1500-byte segment size.  If a network hop in-transit cannot handle a segment that large, it has a choice: either fragment it or reject it.  TCP explicitly sets an option to say, "do not fragment," so the network hop drops the segment.  It is supposed to return an ICMP error, which the sender's TCP instance will use to reduce its segment size.  This algorithm is known as TCP's "path MTU discovery".

But many network components either do not generate ICMP errors, or do not forward them.  This is supposedly done in the name of "security" (don't get me started).  This breaks path MTU discovery.  But the segments are still being dropped, so eventually the TCP sender times out and the web site doesn't work.  Apparently this is fairly rare, but it does happen.  Hence the  "set MTU" questions.  If the user reduces IP's MTU setting, it artificially reduces the maximum segment size used by TCP.  Do a bit of experimenting to find the right value, et Voila!  (French for "finally, I can download porn!")

So, how could Microsoft possibly not find this during their extensive testing?  Well first of all, UDP use is rare compared to TCP.  Multicast UDP is even more rare.  Sending UDP multicast datagrams larger than MTU is getting close to unicorn rare.  And doing all that with the IP MTU set to a non-standard value?  Heck, I consider myself to be a pretty rigorous tester, and I would never have tried that.


Thanks to Mr. Anonymous for asking the question about NIC offloading.  We had considered the question previously (see my response in the comments), but in composing my response, I got to thinking about the offset of the corruption.

It's always in the second packet of the fragmented datagram, and always at 1278.  But that offset is with respect to the start of the UDP payload.  What is the offset with respect to the start of the second packet?  I didn't look at this before since Wireshark's ability to reassemble fragmented datagrams is so handy.  But I went ahead and clicked the "Frame" tab and saw that the corruption happens at offset 40 from the start of the packet.

Guess where the UDP checksum belongs in an UNfragmented datagram!  Yep, offset 40.  Something decided to take the second packet of the fragmented datagram and insert a separate UDP checksum where it *should* go if it were not a fragment.

This still seems like a software bug in Windows.  Sure, maybe the NIC is doing the actual calculation.  Maybe it's not.  But it only happens when IP is configured for a non-standard MTU.  If I have MTU=1500 and I send a fragmented datagram, there is no corruption.


I did some experimenting with datagram size and verified something that I suspected.  When the MTU is set to 1300, the corruption only happens when the datagram size is such that a 1500-byte MTU would *not* fragment but a 1300-byte MTU does.  I.e. there is a size range of 200 bytes (the difference between 1300 and 1500).   This is another reason Microsoft's testers apparently didn't discover this.  Even if they tested fragmentation with non-standard MTUs, would they think to test a size in that specific range?  With the benefit of hindsight, sure, it's "obvious".  But if you're just testing combinations of configurations, you would just pick the "send fragments" combination, which is probably chosen to fragment with MTU 1500.  (FYI: I've updated the original post to refine the conditions of the bug.)

I'm normally not a Microsoft cheerleader, so it feels weird to be defending them on this bug.  :-)


Since we noticed that the corruption always happens at offset 40 in the second packet, I decreased the size of the datagram to only include half of the corrupted pair.  Sure enough, the last byte of the datagram got corrupted.  An the second corrupted byte?  Who knows.  I kind of hoped it would corrupt something in Windows and maybe blue-screen it, but no such luck.  I didn't "see" any misbehavior.

Does that mean there *was* no misbehavior?  NO!  The outgoing datagrams suddenly had bad checksums!  Meaning that the mdump tool stopped receiving them since Linux discards datagrams with bad checksums.  But tcpdump captures the packets *before* UDP discards them, so you can see the bad checksums.

I kept decreasing the size of the datagram till it was 1273 bytes.  That still triggers fragmentation when MTU=1300.  The outgoing datagrams had no visible corruption but had bad checksums.  Reduce one more byte, and the datagram fits in one packet.  Suddenly the checksums are OK.

I tried a few things, like sending packets hard, and varying their sizes, but other than the bad checksums I could not see any obvious Windows misbehavior.

I guess my days as a white-hat hacker are over before they started.  (Did I get the tenses right on that sentence?)

Well, I think I'm done experimenting.  If anybody else reproduces it, please let me know your Windows version.


I heard back from my contact at MS.  He said:

We've looked into this, and see what is happening.  If the customer needs to pursue this rather than using a work around (e.g. not setting the MTU size on the loopback path to a different size than the non-loopback interface, etc.) they will need to open a support ticket.  Thank you for letting me know about this."
Which I suspect translates to, "We'll fix it in a future version, but not urgently.  If you need urgency, pay for support."  :-)


Finally, Hi Reddit users!  Thanks for pushing the hits on this post to many times the total hit count for the whole rest of the blog.  :-)  I read the comments and saw that my first update had already been noticed by somebody else.

Also, something a lot of Reddit comments have fixated on is my claim that UDP multicast is rare.  I meant that the number of programs (and programmers) that use it is very small compared to all software, not that multicast is hardly ever used.  As pointed out, there are several areas of network infrastructure which are multicast-based, so it gets used all the time.  My point is that the number of programmer-hours spent *writing and testing* multicast-based software is very small compared to the overall networking software field.  And as such, it tends not to be as burned-in as, say, TCP.

Also, in most multicast software that I have learned the guts of, the programmer makes sure that datagram sizes are kept small so as to avoid fragmentation.  This seems to be due to the commonly-held idea that you should *never* let IP fragment, which I think comes from the fact that, at least historically, router performance is hurt if it has to perform fragmentation while a datagram is in transit.  I'm not sure if this is still true for modern routers, but historically fragmentation needed to be handled by the supervisory processor.  For the odd packet every now and then, no problem.  For high-rate data flows, it can kill a router.

That seems to be the basis on which a lot of multicast software avoids fragmentation, preferring instead to split large messages into multiple datagrams.  But this reasoning is often not applicable.  Our software intended primarily to be used within a single data center.  When we send a 2K datagram, no router needs to worry about fragmenting it; the sending host's IP stack splits the datagram into packets before they hit the wire.  The intermediate switches and routers all have 1500 MTU, allowing the packets to traverse unmolested.  The final receiving host(s) reassemble and pass the datagram to user space.  This has a noticeable advantage for high-performance applications since the same amount of user data is passed with fewer system calls (the overhead of switching between user and kernel space is significant).

So while I'm sure our software is not alone in sending fragmented multicast datagrams, I stand behind my claim that sending fragmented multicast is relatively rare.


Trampek said...

VS 2005? Seriously?
Did you try to reproduce it using different machine, network card/network driver, VS etc?
It is always mind-blowing when you get such an error. Things designed, developed, deployed, and tested many years ago still have bugs which pop up from time to time and it is really hard to trace them. Kudos guys!

Steve Ford said...

The customer didn't tell us much about his machine; don't even now the OS version, although most of our customers are on 7. Most customers also install high-performance NICs.

Our reproduction machine is a bare-bones with minimal crapware installed. Mother board's NIC. PuTTY, Chrome, mtools, not much else. Didn't even install VS (did the build on a different machine).

But no, I didn't try it on a variety of Windows versions. We spent almost 2 weeks tracking this bugger down, it's time to move on. :-)

Our product software running on the customer's machine bears no resemblance whatsoever to the minimal reproduction tool. Even uses sockets differently (non-blocking, Windows completion ports).

Anonymous said...

Could it be a checksum offloading issue in the NIC driver? Might be worth manually disabling that feature to see if it still happens?

Steve Ford said...

Regarding checksum offloading, tried it. It *is* an interesting theory since the corruption is always exactly two bytes at the same offset. Kinda smells like a checksum, doesn't it?

Windows delegates configuration of "offloading" to the GUI associated with the NIC driver, accessed via interface "properties->configure". Remember that this NIC is the native interface on the machine's mother board, an Intel 82579LM. The configure GUI has no setting associated with checksum offloading. Does that mean the NIC doesn't support it? Or just that the stock configure GUI doesn't let you control it? I believe the latter. When I run wireshark directly on the test machine and capture the packets "on the way out", the UDP checksums are wrong. This is even with MTU=1500 and non-fragmented datagrams. However, when the packet capture is done on a remote machine, it shows the checksums correct. This strongly suggests that the NIC *is* inserting the checksum on the way out.

I will also note that running Wireshark on the test machine does not capture the *second* packet of a fragmented packet! But a remote system is able to capture them. Perhaps the NIC is doing the entire fragmentation operation? And maybe doing it wrong? Again, all I can say is that when the packets are received remotely, the UDP checksum is correct and includes the corrupted bytes. So unless the NIC is doing the checksum twice, it seems unlikely to be the culprit.

metadings said...

I don't know why you are reducing the MTU... I do know that I've set it to 65535; now my machine doesn't split them up, it just moves 64KB per packet instead of 1.5KB...

Steve Ford said...

Hi metadings. If you look in my post under the section, "WHY SET MTU 1300???" you'll see one reason why some people change the MTU.

Also, I think you may be confused what MTU is. It stands for "maximum transmission unit" and represents the maximum sized packet the medium will support. Ethernet is normally 1500, but can be configured to a bit over 9000 (a.k.a. "Jumbo Frames"). I do not know of any common network fabric that supports an MTU of 65535, and neither does Wikipedia. :-)

The UDP protocol *does* support datagrams with 64K bytes of user data. But the IP stack in the operating system splits (fragments - verb) that datagram into MTU-sized chunks (fragments - noun) for transmission. The receiving IP stack reassembles the original 64K datagram and delivers it to the receiving application.

So datagram size is independent of MTU. I earlier made the claim that most software I know the guts of limit their datagrams to fit in one MTU, and gave one reason (not wanting routers to have to do the work of fragmentation). There is another reason why one might avoid large datagrams.

Sending datagrams 64K in length is fine for very reliable networks, or in applications where a moderate rate of lost datagrams is acceptable. Just be aware that whatever the probability of packet loss, large datagrams greatly increase the probability of losing a datagram. A 64K datagram normally requires 45 packets; if any one is lost, the entire datagram is lost. If the probability of losing one packet is 1%, the probability of losing *at least* one of a 45-packete sequence is ... uh ... my statistician wife is out of town today, so I'll just say that it's a LOT higher than 1%. :-)