The Meditative Coder: February 2014

Wednesday, February 19, 2014

Type "n" for Safety! or Strncpy() Considered Risky

Everybody knows that "strncpy()" is the "safe" version of "strcpy()", right? No chance of going past the destination buffer, right?

Um ...

Here's a nice code fragment I found in some old code (names changed to protect the innocent):

char foo[MAX_LINE_LEN] = "";
...
case 'F':
strncpy(foo, optarg, sizeof(foo));
foo_str = foo;
foo_len = strlen(foo);

The "n" in "strncpy()" makes sure that if the "optarg" is longer than "foo", it only copies "MAX_LINE_LEN" characters. No buffer overrun. Nice and safe, right?

But look at what the man page for "strncpy()" says:

Warning: If there is no null byte among the first n bytes of src, the string placed in dest will not be null-terminated.

If optarg has "MAX_LINE_LEN" or more non-null characters, "foo" ends up without a null termination! What happens to the "strlen(foo)"? It runs right off the end and sooner or later will almost certainly result in a seg fault. (I heard from a co-worker that we had to fix a lot of instances of this same bug in our code base.)

---

You could fix this bug by adding one line below the "strncpy()":
foo[sizeof(foo)-1] = '\0';
But I don't like this either. It leaves intact a different objection I have with the original code: it silently truncates something that the user might not want truncated. Good tools should not silently do anything different than what the user requested.

So, how about this:
case 'F':
foo[sizeof(foo)-1] = '\0';
strncpy(foo, optarg, sizeof(foo));
if (foo[sizeof(foo)-1] != '\0') {
fprintf(stderr, "Warning message!");
foo[sizeof(foo)-1] = '\0';
}
GAH! I've seen code like this and I don't like it. That code is non-obvious.

Here's a tip for writing good code. How would you verbally describe the basic high-level approach to somebody? Would you say:

First set a null at the end of the buffer. We do this because we want to find out if the strncpy function truncates the string. If so, then that null will be overwritten with the last character copied. So we after copying, we check that last character; if the null was overwritten, then we warn the user that it was truncated, and then we add the final null (thus truncating it even further).

No, you wouldn't. You would say something more like:

If the string fits, copy it. If it doesn't, warn the user and copy in as much as fits, and null-terminate it.

How about this:
case 'F':
if (strlen(optarg) < sizeof(foo))
strcpy(foo, optarg);
else {
fprintf(stderr, "Warning message!");
memcpy(foo, optarg, sizeof(foo)-1);
foo[sizeof(foo)-1] = '\0';
}
...

I like this better; it flows almost exactly as the description does.

And yes, there is an irrational part of me that wants to use "strncpy()" inside the "if" statement, even though it isn't needed. You're just "supposed to". But "strcpy()" IS more efficient, and it is possible to go overboard with the whole "belt and braces" thing. However, I would not object to somebody using "strncpy()" there.

I guess the moral of the story is: just because the function has an "n" in it doesn't mean that people are going to use it safely.

Friday, February 7, 2014

Configure a Cron Job with a Wiki

I have some periodic cron jobs that need extra configuration. For example, one of them generates a report on bug statistics on a code branch basis. So I need to tell it which code branches to process. I could just put the list of branch tags on the command line of the report generator, and just use "crontab -e" while logged in to modify it. However, I want anybody to be able to maintain the list, without having to know my password or the syntax for crontab.

It turns out that we installed Mediawiki locally for our own internal wiki. So I created a wiki page with a table listing the code branches that are active. Then I wrote a script which uses "curl" to fetch that wiki page and parse out the branches. This gives me a nice web-based GUI interface to the tool that everybody is already familiar with. Everybody here knows how to use Wikipedia, so anybody can go in and change the list of branches.

After doing some additional development, I wanted to be able to include additional configuration for the cron job which I didn't particularly want displayed on the wiki page. You can use  with Mediawiki and it won't display it on the page. Unfortunately, it completely withholds the comment from the html of the page. I.e. you can't see it even if you display source of the page. You only see the comment when you edit the page.

So here's what I ended up with:

#!/bin/sh

# all.sh

# Generate QA report for all active releases.  Runs via cron nightly.

date

# Read wiki page and select rows in the release table ("|" in col 1).

# Uses "action=edit" so that  are included (for option processing).

curl 'http://localwiki/index.php?title=page_title&action=edit' | egrep "^\|" >all.list

# read the contents of all.list, line at a time.  Each line is an entry in the table of active releases.

while read ILINE; do :

    # Extract target milestone (link text of web-based report page).

    TARGET_MILESTONE=`echo "$ILINE" | sed -n 's/^.*_report.html \(.*\)\].*$/\1/p'`

    # Extract the (optional) set of command-line options.

    OPTS=`echo "$ILINE" | sed -n 's/^.*--OPTS:\([^:]*\):.*$/\1/p'`

    eval ./qa_report $OPTS \"$TARGET_MILESTONE\" 2>&1

done <all.list

The "sed" commands use "-n" to suppress printing of lines to stdout. Adding a "p" suffix to the end of a sed command forces a print, if the command is successful. So, for example, the line:
OPTS=`echo "$ILINE" | sed -n 's/^.*--OPTS:$[^:]*$:.*$/\1/p'`
If the contents of $ILINE does not contain match the pattern (i.e. does not have an option string), the "s" command is not successful and therefore doesn't print, leaving OPTS empty.

One final interesting note: the use of "eval" to run the qa_report.sh script. Why couldn't you just use this?
./qa_report $OPTS "$TARGET_MILESTONE" 2>&1

Let's say that $TARGET_MILESTONE is "milestone" and the contents of $OPT is:

    -a "b c" -d "e f"

If you omit the "eval", you would expect the resulting command line to be:

./qa_report -a "b c" -d "e f" "milestone" 2>&1

I.e. the qa_report tool will see "b c" as the value for the "-a" option, and "e f" as the value for the "-d" option. But the shell doesn't work this way. The line:

./qa_report $OPTS \"$TARGET_MILESTONE\" 2>&1

will expand $OPTS, but it won't group "b c" as a single entity for -a. Without "eval", the "-a" option will only see the two-character value "b (with the quote mark). I found a good explanation for this; the short version is that the shell does quote processing before it does symbol expansion. So essentially, the thing you need to do is have the shell parse the command line twice.

The "eval" form of the command works like this:
eval ./qa_report $OPTS \"$TARGET_MILESTONE\" 2>&1
First the shell looks at this command line and parses it with "eval" as the command and the rest as "eval"s parameters. It does the symbol substitution. Thus, the thing that gets passed to "eval" is:
./qa_report -a "b c" -d "e f" "milestone"
What does the eval command do with that? It passes it to the shell for parsing! In this pass, "./qa_report" is the command, and the rest are the parameters. Since the shell is parsing it from scratch, it will group "b c" as a single entity, letting the "-a" option pick it up as a single string.

Monday, February 3, 2014

Syns, Syn Cookies, TCP Listen Backlog: More Complicated than You Think

No, syncookies don't have anything to do with dieting. But they did come up as I learned that the TCP listen backlog is more complicated than I thought. This article should help those of you trying to support TCP servers with lots of clients, especially if large numbers of clients can try to connect at the same time. (For example, a popular web server.) This is Linux-oriented; I'm not sure how applicable the info is for other OSes.

---

Here is an article which talks about the TCP listen backlog. Here are some quotes:

The backlog has an effect on the maximum rate at which a server can accept new TCP connections on a socket. ... Many systems (particularly BSD-derived or influenced) silently truncate this value (the backlog parameter to the listen() system call) to 5 — version 1.2.13 of the Linux kernel [really old - SF] does this ... Using small values for the listen backlog was one of the major causes of poor web server performance with many operating systems up until recently. ... The backlog parameter is silently truncated to SOMAXCONN ... defined as 128 in /usr/src/linux/socket.h for 2.x kernels. ---

Here is a brilliant writeup that taught me about "syncookies", and how they can lead to hung clients. Basically, if the listen backlog (a.k.a. the SYN queue) fills up and more client connection requests (SYNs) come in, the server will *act* like it is accepting them by responding with syncookies. But the kernel won't actually set up state for those connections or inform the app of the new connection. Instead, the server waits for the client to respond with the ACK (the third step of the 3-way handshake). That ACK contains enough information for the server to reconstruct the initial SYN, and the kernel proceeds to open the connection as normal. HOWEVER, if the client's ACK gets lost in the switch or the NIC or whatever, then the client will be left thinking the connection was accepted and is ready, and the server will have no memory of it.

This leads to a genuine hang if the application protocol depends on the server sending the first message, like SMTP or MySQL. In these cases, the client app will hang forever waiting for the server to send its message.

---

Here is an article which gives advice on how to set up systems that can accept lots of TCP connections. Here's a quote:

/proc/sys/fs/file-max: The maximum number of concurrently open files. We recommend a limit of at least 32,832.
/proc/sys/net/ipv4/tcp_max_syn_backlog: Maximum number of remembered connection requests, which are still did not receive an acknowledgment from connecting client. The default value is 1024 for systems with more than 128Mb of memory, and 128 for low memory machines. If server suffers of overload, try to increase this number.
/proc/sys/net/core/somaxconn: Limit of socket listen() backlog, known in userspace as SOMAXCONN. Defaults to 128. The value should be raised substantially to support bursts of request. For example, to support a burst of 1024 requests, set somaxconn to 1024.

Here are some commands I entered on our host Saturn:

   sford@Saturn$ cat /proc/sys/net/core/somaxconn
   128
   sford@Saturn$ cat /proc/sys/net/ipv4/tcp_max_syn_backlog
   2048
   sford@Saturn$ cat /proc/sys/fs/file-max
   3263962
   sford@Saturn$

Looks like the main thing we need to do is increase somaxconn, and maybe tcp_max_syn_backlog as well.

---

One small concern. I saw various references to SOMAXCONN as being a constant in a system include file. It apparently lives in different places, depending on the OS flavor/version; I found it here:

   sford@Saturn$ find /usr/include | xargs egrep SOMAXCONN
   /usr/include/bits/socket.h:#define SOMAXCONN 128

So now the question becomes, if we update the tuning parameter, do we also have to modify the include file? My gut says no. I'm thinking that maybe if you built the kernel from source, you would perhaps change the default via that include, but on a running system you simply override that default and you can magically use larger numbers in the listen() call.

Socket buffers: more complicated than you think

If you are receiving UDP datagrams (multicast or unicast, no difference), how much socket buffer does a datagram consume? I.e. how many datagrams of a particular size can you fit in a socket buffer configured for a given size?

Well ... it's complicated.

I've tried some experiments on two of our Linux systems, and encountered some surprises. Note that my experiments were performed with modified versions of the msend and mdump tools, i.e. simple UDP with no higher-level protocol on top of it. (See my Github project for my modified versions.) The modified mdump command sets up the socket, prints a prompt, and waits for the user to hit return before entering the receive loop. I had msend sending 500 messages with 10 ms between sends (nice and slow so as not to overrun the NIC). Since the mdump is not yet in its receive loop, the datagrams are stored in the socket buffer. When the send finishes, I hit return on mdump, which enters the receive loop and empties the socket buffer, collecting statistics. Then I hit control-c on mdump, and it reports the number of messages and bytes received. Finally, I did experiments on both unicast and multicast; the results are the same.

Here are some results for a two-system test, sending from host "orion", receiving on host "saturn". The message sizes and bytes received shown are for UDP payload. Receive socket buffer configured for 100,000 bytes. Note that 1472 is the largest UDP payload which can be sent in a single ethernet frame (i.e. no IP fragmentation).

message size	messages received	bytes received
1472	61	89792
215	61	13115
214	157	33598
1	157	157

Interesting. The number of messages seems to not depend on message size, except for a discontinuity at 215 bytes. I checked a lot of other message sizes, and they all follow the pattern: 61 messages for sizes >= 215, 157 messages for sizes <= 214.

Now let's double the receiver socket to 200,000 bytes:

message size	messages received	bytes received
1472	121	178112
215	121	26015
214	313	66982
1	313	313

The messages received are approximately doubled, with the discontinuity at the exact same message size. Cutting the original socket buffer in half to 50,000 approximately cuts the message counts in half, with the discontinuity at the same place (I won't bother including the table).

Now lets switch the roles: send from saturn, receive on orion. Socket buffer back to 100,000 bytes.

message size	messages received	bytes received
1472	77	113344
215	77	16555
214	363	77682
1	363	363

The discontinuity is at the same place, but different numbers of messages are received. The linux kernel versions are very close to the same - Saturn is 2.6.32-358.6.1.el6.x86_64 and orion is 2.6.32-431.1.2.0.1.el6.x86_64. Both systems have 32 gig of memory and are using Intel 82576 NICs. Saturn has 2 physical CPUs with 6 cores each, and orion has 2 physical CPUs with 4 cores each and hyperthreading turned on. I'm don't know why they hold different numbers of messages in the same-sized socket buffer.

These machines also have 10G Solarflare NICs in them, so let's give that a try. Send from saturn, receive on orion, socket buffer 100,000 bytes.

message size	messages received	bytes received
1472	110	161920
1	110	110

Whoa! That's right - when using the Solarflare card, the socket buffer held more bytes of data than the configured socket buffer size! But this isn't necessarily unexpected; the man page for socket(7) says this about setting the receive socket buffer: "The kernel doubles this value (to allow space for bookkeeping overhead)". Finally, it's interesting that there is no discontinuity - 110 messages, regardless of size.

Let's stick with the Solarflare cards, and go back to orion sending, saturn receiving (still 100,000 byte socket buffer):

message size	messages received	bytes received
1472	87	128064
1	87	87

Fewer messages, but still exceeds 100,000 bytes worth with large messages.

Now let's put both sender and receiver on saturn (loopback), with 100,000 byte socket buffer:

message size	messages received	bytes received
1472	87	128064
582	87	50634
581	157	91217
70	157	10990
69	261	18009
1	261	261

Lookie there! Two discontinuities.

Someday maybe I'll try this on other OSes (our lab has Windows, Linux, Solaris, HP-UX, AIX, FreeBSD, MacOS). Don't hold your breath. :-)

I did try a bit with TCP instead of UDP. It's a little trickier since instead of generating loss, TCP flow controls. And you have to take into account the send-side socket buffer. And I wanted to force small segments (packets), so I set the TCP_NODELAY socket option (to disable Nagle's algorithm). The results were much more what one might expect - the amount buffered depended very little on the segment size. With 1400-byte messages, it buffered 141,400 bytes. With 100-byte messages, it buffered 139,400 messages. I suspect the reduction is due to more overhead bytes. (I didn't try it with different NICs or different hosts.)

The moral of the story is: the socket buffer won't hold as much UDP data as you think it will, especially when using small messages.

UPDATE: on a colleague's suggestion, I looked at the "recv-Q" values reported by netstat. On Linux, I sent a single UDP datagram with one payload byte. The "recv-Q" value reported was 1280 for an Intel NIC, and 2304 for a Solarflare NIC. When I set the socket buffer to 100,000 bytes and fill it with UDP datagrams, "recv-Q" reports a bit over 200,000 bytes - double the socket buffer size I specified. (Remember that socket(7) says that the kernel doubles the buffer size to allow space for bookkeeping overhead.)

UPDATE2:I'm not the first one to wonder about this. See https://www.unixguide.net/network/socketfaq/5.9 (that info is for BSD, not Linux).

The Meditative Coder