Monday, February 3, 2014

Socket buffers: more complicated than you think

If you are receiving UDP datagrams (multicast or unicast, no difference), how much socket buffer does a datagram consume?  I.e. how many datagrams of a particular size can you fit in a socket buffer configured for a given size?

Well ... it's complicated.

I've tried some experiments on two of our Linux systems, and encountered some surprises.  Note that my experiments were performed with modified versions of the msend and mdump tools, i.e. simple UDP with no higher-level protocol on top of it.  The modified mdump command sets up the socket, prints a prompt, and waits for the user to hit return before entering the receive loop.  I had msend sending 500 messages with 10 ms between sends (nice and slow so as not to overrun the NIC).  Since the mdump is not yet in its receive loop, the datagrams are stored in the socket buffer.  When the send finishes, I hit return on mdump, which enter the receive loop and empties the socket buffer, collecting statistics.  Then I hit control-c on mdump, and it reports the number of messages and bytes received.  Finally, I did experiments on both unicast and multicast; the results are the same.

Here are some results for a two-system test, sending from host "orion", receiving on host "saturn".  The message sizes and bytes received shown are for UDP payload.  Receive socket buffer configured for 100,000 bytes.  Note that 1472 is the largest UDP payload which can be sent in a single ethernet frame (i.e. no IP fragmentation).

message
size
messages
received
bytes
received
14726189792
2156113115
21415733598
1157157

Interesting.  The number of messages seems to not depend on message size, except for a discontinuity at 215 bytes.  I checked a lot of other message sizes, and they all follow the pattern: 61 messages for sizes >= 215, 157 messages for sizes <= 214.


Now let's double the receiver socket to 200,000 bytes:

message
size
messages
received
bytes
received
1472121178112
21512126015
21431366982
1313313

The messages received are approximately doubled, with the discontinuity at the exact same message size.  Cutting the original socket buffer in half to 50,000 approximately cuts the message counts in half, with the discontinuity at the same place (I won't bother including the table).


Now lets switch the roles: send from saturn, receive on orion.  Socket buffer back to 100,000 bytes.

message
size
messages
received
bytes
received
147277113344
2157716555
21436377682
1363363

The discontinuity is at the same place, but different numbers of messages are received.  The linux kernel versions are very close to the same - Saturn is 2.6.32-358.6.1.el6.x86_64 and orion is 2.6.32-431.1.2.0.1.el6.x86_64.  Both systems have 32 gig of memory and are using Intel 82576 NICs.  Saturn has 2 physical CPUs with 6 cores each, and orion has 2 physical CPUs with 4 cores each and hyperthreading turned on.  I'm don't know why they hold different numbers of messages in the same-sized socket buffer.


These machines also have 10G Solarflare NICs in them, so let's give that a try.  Send from saturn, receive on orion, socket buffer 100,000 bytes.

message
size
messages
received
bytes
received
1472110161920
1110110

Whoa! That's right - when using the Solarflare card, the socket buffer held more bytes of data than the configured socket buffer size!  But this isn't necessarily unexpected; the man page for socket(7) says this about setting the receive socket buffer: "The kernel doubles this value (to allow space for bookkeeping overhead)". Finally, it's interesting that there is no discontinuity - 110 messages, regardless of size.


Let's stick with the Solarflare cards, and go back to orion sending, saturn receiving (still 100,000 byte socket buffer):

message
size
messages
received
bytes
received
147287128064
18787

Fewer messages, but still exceeds 100,000 bytes worth with large messages.


Now let's put both sender and receiver on saturn (loopback), with 100,000 byte socket buffer:

message
size
messages
received
bytes
received
147287128064
5828750634
58115791217
7015710990
6926118009
1261261

Lookie there! Two discontinuities.


Someday maybe I'll try this on other OSes (our lab has Windows, Linux, Solaris, HP-UX, AIX, FreeBSD, MacOS).  Don't hold your breath.  :-)


I did try a bit with TCP instead of UDP.  It's a little trickier since instead of generating loss, TCP flow controls.  And you have to take into account the send-side socket buffer.  And I wanted to force small segments (packets), so I set the TCP_NODELAY socket option (to disable Nagle's algorithm). The results were much more what one might expect - the amount buffered depended very little on the segment size. With 1400-byte messages, it buffered 141,400 bytes. With 100-byte messages, it buffered 139,400 messages. I suspect the reduction is due to more overhead bytes.  (I didn't try it with different NICs or different hosts.)


The moral of the story is: the socket buffer won't hold as much UDP data as you think it will, especially when using small messages.

UPDATE: on a colleague's suggestion, I looked at the "recv-Q" values reported by netstat.  On Linux, I sent a single UDP datagram with one payload byte.  The "recv-Q" value reported was 1280 for an Intel NIC, and 2304 for a Solarflare NIC.  When I set the socket buffer to 100,000 bytes and fill it with UDP datagrams, "recv-Q" reports a bit over 200,000 bytes - double the socket buffer size I specified.  (Remember that socket(7) says that the kernel doubles the buffer size to allow space for bookkeeping overhead.)

UPDATE2: see http://www.unixguide.net/network/socketfaq/5.9.shtml 

No comments: