The Meditative Coder: Linux

Showing posts with label Linux. Show all posts

Thursday, June 27, 2024

SIGINT(2) vs SIGTERM(15)

This is another of those things that I've always known I should just sit down and learn but never did: what's the difference between SIGINT and SIGTERM? I knew that one of them corresponded to control-c, and the other corresponded to the kill command's default signal, but I always treated them the same, so I never learned which was which.

SIGINT (2) - User interrupt signal, typically sent by typing control-C. The receiving program should stop performing its current operation and return as quickly as appropriate. For programs that maintain some kind of persistent state (e.g. data files), those programs should catch SIGINT and do enough cleanup to maintain consistency of state. For interactive programs, control-C might not exit the program, but instead return to the program's internal command prompt.
SIGTERM (15) - Graceful termination signal. For example, when the OS gracefully shuts down, it will send SIGTERM to all processes. It's also the default signal sent by the "kill" command. It is not considered an emergency and so does not expect the fastest possible exit; rather a program might allow the current operation to complete before exiting, so long as it doesn't take "too long" (whatever that is). Interactive programs should typically NOT return to their internal command prompt and should instead clean up (if necessary) and exit.

This differentiation was developed when the Unix system had many users and a system operator. If the operator initiated a shutdown, the expectation was that interactive programs would NOT just return to the command prompt, but instead would respect the convention of cleaning up and exiting.

However, I've seen that convention not followed by "personal computer" Unix systems, like MacOS. With a personal computer, you have a single user who is also the operator. If you, the user and operator, initiate a shutdown on a Mac, there can be interactive programs that will pause the shutdown and ask the user whether to save their work. It still represents a difference in behavior between SIGINT and SIGTERM - SIGINT returns to normal operation while SIGTERM usually brings up a separate dialogue box warning the user of data loss - but the old expectation of always exiting is no longer universal.

Friday, July 9, 2021

More Perl "grep" performance

In an earlier post, I discovered that a simple Perl program can outperform grep by about double. Today I discovered that some patterns can cause the execution time to balloon tremendously.

I have a new big log file, this time with about 70 million lines. I'm running it on my newly-updated Mac, whose "time" command has slightly different output.

Let's start with this:

time grep 'asdf' cetasfit05.txt

... 39.388 total

time grep.pl 'asdf' cetasfit05.txt

... 21.388 total

About twice as fast.

Now let's change the pattern:

time grep 'XBT|XBM' cetasfit05.txt

... 24.787 total

time grep.pl 'XBT|XBM' cetasfit05.txt

... 18.940 total

Still faster, but nowhere near twice as fast. I don't know why

Now let's add an anchor:

time grep '^XBT|^XBM' cetasfit05.txt

... 25.580 total

time grep.pl '^XBT|^XBM' cetasfit05.txt

... 3:08.25 total

WHOA! Perl, what happened????? 3 MINUTES???

My only explanation is that Perl tries to implement a very general regular expression algorithm, and grep implements a subset, and that might cause Perl to be slow in some circumstances. For example, maybe the use of alternation with anchors introduces the need for "backtracking" under some circumstances, and maybe grep doesn't support backtracking. In this simple example, backtracking is probably not necessary, but to be general, Perl might do it "just in case". (Note: I'm not a regular expression expert, and don't really know when "backtracking" is needed; I'm speculating without bothering to learn about it.)

Anyway, let's make a small adjustment:

time grep.pl '^(XBT|XBM)' cetasfit05.txt

... 17.910 total

There, that got back to "normal".

I guess multiple anchors in a pattern is a bad idea.

P.S. - even though this post is about Perl, I tried one more test with grep:

time grep 'ASDF' cetasfit05.txt

... 26.132 total

Whaaa...? I tried multiple times, and lower-case 'asdf' always takes about 40 seconds, and upper-case 'ASDF' always takes about 27 seconds. I DON'T UNDERSTAND COMPUTERS!!! (sob)

Sunday, November 29, 2020

Using sed "in place" (gnu vs bsd)

I'm not crazy after all!

Well, ok, I guess figuring out a difference in "sed" between gnu and bsd is not a sign of sanity.

(TL;DR this works on both: sed -i.bak -e "s/x/y/" x.txt)

I use sed a fair amount in my shell scripts. Recently, I've been using "-i" a lot to edit files "in-place". The "-i" option takes a value that is interpreted as a file name suffix to save the pre-edited form of the file. You know, in case you mess up your sed commands, you can get back your original file.

But for some scripts, the file being edited is itself generated, so there is no need to save a backup. So, just pass a null string in as the suffix. Right?

[ update: useful page: https://riptutorial.com/sed/topic/9436/bsd-macos-sed-vs--gnu-sed-vs--the-posix-sed-specification ]

BSD SED (FreeBSD and Mac)

$ echo "x" >x.txt
$ sed -i '' -e "s/x/y/" x.txt
$ cat x
y
$ ls
x.txt

Looks good. Let's try Linux.

GNU SED (Linux and Cygwin)

$ echo "x" >x.txt
$ sed -i '' -e "s/x/y/" x.txt
sed: can't read : No such file or directory
$ cat x
y
$ ls
x.txt

Hmm ... that's odd. It "worked", which is to say the file was properly edited. But what's with that "no such file" error? Man page to the rescue:

$ man sed
...
-i[SUFFIX], --in-place[=SUFFIX]
edit files in place (makes backup if SUFFIX supplied)

Interesting, you can omit the suffix. And you mustn't supply a space between the "-i" and the suffix; it thinks you've omitted it and treats the empty string as an input file. Here's an example with a non-empty suffix:

$ echo "x" >x.txt
$ sed -i .bak -e "s/x/y/" x.txt
sed: can't read .bak: No such file or directory
$ cat x
y
$ ls
x.txt

See? With the space, it thinks ".bak" is an input file. But we don't want a backup file, so let's try just omitting the suffix, like the man page says.

$ echo "x" >x.txt
$ sed -i -e "s/x/y/" x.txt
$ cat x
y
$ ls
x.txt

Works. Let's try it on BSD.

BSD SED (FreeBSD and Mac)

$ echo "x" >x.txt
$ sed -i -e "s/x/y/" x.txt
$ cat x
y
$ ls
x.txt x.txt-e

Wait, what? Again, the file was edited properly, but what's with that file "x.txt-e"? Oh, BSD sed doesn't support a missing suffix. You can supply an empty one, but you can't just omit it. So sed looked at the above command line and thought "-e" was my desired suffix. And the "-e" option is optional in front of an in-line sed program.

ARGH!

I use both Mac and Linux, and want scripts that work on both!

THE SOLUTION

There is no portable way to tell both seds that you want in-place editing but don't want a backup suffix. So just go ahead and always generate a backup file. And remember, GNU doesn't like a space after the "-i". This works on both:

$ echo "x" >x.txt
$ sed -i.bak -e "s/x/y/" x.txt
$ cat x.txt
y
$ ls
x.txt x.txt.bak

Works on Mac and Linux. Just delete the .bak file.

It took a long time to figure all this out, largely because the incorrect usages basically worked. I.e. the intended file did get edited in place, but with undesired side effects. So I didn't notice there was a problem until I really looked at things and saw the error or the "x.txt-e" file.

Corner cases: the bane of programmers everywhere.

Friday, October 30, 2020

Software Sucks

Sorry, I had to say it. Software really does suck.

We just installed a new CentOS, and I wanted to do some apache work. I don't do that kind of thing very often, so I don't just remember how to do it. Thank goodness for search engines!

Do a quick google for "apache shutdown" which led me to https://httpd.apache.org/docs/2.4/stopping.html which tells me to do a "apachectl -k graceful-stop". Cool. Enter that command.

Passing arguments to httpd using apachectl is no longer supported.
You can only start/stop/restart httpd using this script.
If you want to pass extra arguments to httpd, edit the
/etc/sysconfig/httpd config file.

Um ... stopping httpd is exactly what I was trying to do. So I guessed that 2.4 must be old doc. Rather than trying to find new doc, I just entered

apachectl -h

It responded with:

Usage: /usr/sbin/httpd [-D name] [-d directory] [-f file]
                       [-C "directive"] [-c "directive"]
                       [-k start|restart|graceful|graceful-stop|stop]
                       [-v] [-V] [-h] [-l] [-L] [-t] [-T] [-S] [-X]
Options:
...

There's the "-k graceful-stop" all right. What's the problem? Well, except of course, for the stupid fact that the Usage line claims the command is "httpd", not "apachectl". Some newbie must have written the help screen for apachectl.

Another search for "Passing arguments to httpd using apachectl is no longer supported" wasn't very helpful either, but did suggest "man apachectl". Which says:

When acting in pass-through mode, apachectl can take all the arguments available for the httpd binary.
...
When acting in SysV init mode, apachectl takes simple, one-word commands, defined below.
...

How might I know which mode it's working in? Dunno. But a RedHat site gave an example of:

apachectl graceful

which matches the SysV mode. So apparently the right command is "apachectl graceful-stop" without the "-k". Which worked.

So why did "apachectl -h" give bad help? I think it just passed the "-h" to httpd (passthrough), so the help screen was printed by httpd. But shouldn't apachectl have complained about "-h"? GAH!

Software sucks.

Saturday, July 11, 2020

Perl Faster than Grep

So, I've been crawling through a debug log file that is 195 million lines long. I've been using a lot of "grep | wc" to count numbers of various log messages. Here's some timings for my Macbook Pro:

$ time cat dbglog.txt >/dev/null
real 0m35.423s

$ time wc dbglog.txt
195177935 1177117603 28533284864 dbglog.txt
real 1m44.560s

$ time egrep '999999' dbglog.txt
real 7m39.737s

(For this timing, I chose a pattern that would *NOT* be found.)

On the Macbook, the man page for fgrep claims that it is faster than grep. Let's see:

$ time fgrep '999999' dbglog.txt

real 7m11.365s

Well, I guess it's a little faster, but nothing to brag about.

Then I wanted to create a histogram of some findings, so I wrote a perl script to scan the file and create the histogram. Since it performed regular expression matching on every line, I assumed it would be a little slower than grep, since Perl is an interpreted language.

$ time ./count.pl dbglog.txt >count.out

real 3m9.427s

WOW! Less than half the time!

So I created a simple grep replacement: grep.pl. It doesn't do any histogramming, so it should be even faster.

$ time grep.pl '999999' dbglog.txt

real 2m8.341s

Amazing. Perl grep runs in less than a third the time of grep.

For small files, I bet Perl grep is slower starting up. Let's see.

$ time echo "hi" | grep 9999

real 0m0.051s

$ time echo "hi" | grep.pl 9999

real 0m0.113s

Yep. Grep saves you about 60 milliseconds. So if you had thousands of small files to grep, it might be faster to use grep.

See https://github.com/fordsfords/grep.pl

UPDATE:

I got another big log file today (70 million lines) and saw something pretty surprising given my initial findings.

See https://blog.geeky-boy.com/2021/07/more-perl-grep-performance.html

Thursday, January 24, 2019

Volatile considered harmful

I happened on this today. The article is narrowly-focused on Linux kernel work, but in my mind it helps to clarify a lot of "volatile" debate I've seen over the years.

https://www.kernel.org/doc/html/latest/process/volatile-considered-harmful.html

I will note that when Corbet (the author) says, "the 'volatile' type class should not be used", what he really means is that you should not declare variables with volatile (or rather, almost never). Corbet says, "the kernel primitives which make concurrent access to data safe ... If they are used properly, there will be no need to use volatile as well." Some of those kernel primitives use volatile, but not in variable declarations. Instead they use volatile in carefully-selected casts.

For example, as described in another Corbet article, he talks about another kernel primitive, "ACCESS_ONCE()". It is defined as:

    #define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))

The variable being accessed is temporarily cast to volatile to allow code to be written that violates threading assumptions made by the compiler's optimizer. Like here, for example. :-)

I only bring this up to point out that Corbet is not arguing that volatile should never be used at all. Rather he is arguing that programmers (specifically kernel programmers) should not declare variables to be volatile. Generally, programmers should use threading primitives to ensure correct code, and if the code's requirements prevent the use of the usual threading primitives, then lower-level primitives (like "ACCESS_ONCE()") should be used to precisely target volatile's use.

Monday, May 22, 2017

Some multicast programming tips

Never too old to learn. :-)

There are lots of multicast example programs out there, so I won't try to compete with them. But I did run across several things that weren't explained very well.

Single Socket, Multiple Groups

Yes, you can create a single socket and have it receive datagrams from multiple multicast groups. Just include multiple calls to:
setsockopt(recv_sock, IPPROTO_IP, IP_ADD_MEMBERSHIP, ...

Multiple Sockets, One Group per Socket

This is another common use case, where you create multiple sockets for receiving, with each socket joined to a different multicast group.

Binding the Receive Socket

Since a socket needs to be bound to a port to receive any kind of UDP datagram, multicast or unicast, you need to include a call to bind(). You pass in a sockaddr_in with the sin_port set as desired (remember to pass it in network order). But what about the sin_addr? What do you set that to?

Many people set it to INADDR_ANY, which is what I did in a recent program. But in the multiple sockets, different group per socket case, it had an unexpected side effect. All of my sockets were bound to the same destination port, but joined to different multicast groups. With sin_addr set to INADDR_ANY, the kernel took each received datagram, replicated it, and delivered a copy to *every* socket, even if the datagram's destination group is different from the one joined to the stocket! I.e. simply doing the IP_ADD_MEMBERSHIP on a socket didn't filter datagrams based on the desired group. When a multicast datagram was received, the kernel just used the destination port and delivered a copy to every UDP socket bound to that port and INADDR_ANY.

I had to do some extra searching to find out that you can set the bind's sin_addr to the multicast group. I have some reason to suspect that this is not portable across all operating systems, but at least it works on Linux. Now I can have 10 sockets, each bound to the same port (don't forget SO_REUSEADDR) but different multicast groups. When a multicast datagram is received, it is delivered *only* to the socket which is bound to the right port/multicast group pair.

Single Socket, Multiple Groups, reprise

So, what about the case where you have a single socket joined to multiple groups? In that case, you *do* want to use INADDR_ANY in the bind.

Mix and Match?

I guess this poses a restriction. You can't have, say, 2 sockets that you distribute 4 multicast groups across, with two groups each. Why would you want to do that? Maybe to load-balance across threads. But assuming they all want to bind to the same port, you can't do it. Setting the sin_addr to INADDR_ANY prevents filterig, and will mean that both sockets will receive a copy of every datagram sent. But you can't set sin_addr to multiple multicast groups.

So if you want to have multiple sockets, multiple groups, and the same destination port, you need to have one group per socket, and bind that socket to the group.

Friday, November 18, 2016

Linux network stack tuning

Found a nice blog post that talks about tuning the Linux network stack:

http://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/

I notice that it doesn't talk about pinning interrupts to the NUMA zone that the NIC is physically connected to (or at least "NUMA" doesn't appear in the post), so it doesn't have *everything*. :-)

And it also doesn't mention kernel bypass libraries, like OpenOnload.

But it has a lot of stuff in it.

Friday, February 7, 2014

Configure a Cron Job with a Wiki

I have some periodic cron jobs that need extra configuration. For example, one of them generates a report on bug statistics on a code branch basis. So I need to tell it which code branches to process. I could just put the list of branch tags on the command line of the report generator, and just use "crontab -e" while logged in to modify it. However, I want anybody to be able to maintain the list, without having to know my password or the syntax for crontab.

It turns out that we installed Mediawiki locally for our own internal wiki. So I created a wiki page with a table listing the code branches that are active. Then I wrote a script which uses "curl" to fetch that wiki page and parse out the branches. This gives me a nice web-based GUI interface to the tool that everybody is already familiar with. Everybody here knows how to use Wikipedia, so anybody can go in and change the list of branches.

After doing some additional development, I wanted to be able to include additional configuration for the cron job which I didn't particularly want displayed on the wiki page. You can use  with Mediawiki and it won't display it on the page. Unfortunately, it completely withholds the comment from the html of the page. I.e. you can't see it even if you display source of the page. You only see the comment when you edit the page.

So here's what I ended up with:

#!/bin/sh

# all.sh

# Generate QA report for all active releases.  Runs via cron nightly.

date

# Read wiki page and select rows in the release table ("|" in col 1).

# Uses "action=edit" so that  are included (for option processing).

curl 'http://localwiki/index.php?title=page_title&action=edit' | egrep "^\|" >all.list

# read the contents of all.list, line at a time.  Each line is an entry in the table of active releases.

while read ILINE; do :

    # Extract target milestone (link text of web-based report page).

    TARGET_MILESTONE=`echo "$ILINE" | sed -n 's/^.*_report.html \(.*\)\].*$/\1/p'`

    # Extract the (optional) set of command-line options.

    OPTS=`echo "$ILINE" | sed -n 's/^.*--OPTS:\([^:]*\):.*$/\1/p'`

    eval ./qa_report $OPTS \"$TARGET_MILESTONE\" 2>&1

done <all.list

The "sed" commands use "-n" to suppress printing of lines to stdout. Adding a "p" suffix to the end of a sed command forces a print, if the command is successful. So, for example, the line:
OPTS=`echo "$ILINE" | sed -n 's/^.*--OPTS:$[^:]*$:.*$/\1/p'`
If the contents of $ILINE does not contain match the pattern (i.e. does not have an option string), the "s" command is not successful and therefore doesn't print, leaving OPTS empty.

One final interesting note: the use of "eval" to run the qa_report.sh script. Why couldn't you just use this?
./qa_report $OPTS "$TARGET_MILESTONE" 2>&1

Let's say that $TARGET_MILESTONE is "milestone" and the contents of $OPT is:

    -a "b c" -d "e f"

If you omit the "eval", you would expect the resulting command line to be:

./qa_report -a "b c" -d "e f" "milestone" 2>&1

I.e. the qa_report tool will see "b c" as the value for the "-a" option, and "e f" as the value for the "-d" option. But the shell doesn't work this way. The line:

./qa_report $OPTS \"$TARGET_MILESTONE\" 2>&1

will expand $OPTS, but it won't group "b c" as a single entity for -a. Without "eval", the "-a" option will only see the two-character value "b (with the quote mark). I found a good explanation for this; the short version is that the shell does quote processing before it does symbol expansion. So essentially, the thing you need to do is have the shell parse the command line twice.

The "eval" form of the command works like this:
eval ./qa_report $OPTS \"$TARGET_MILESTONE\" 2>&1
First the shell looks at this command line and parses it with "eval" as the command and the rest as "eval"s parameters. It does the symbol substitution. Thus, the thing that gets passed to "eval" is:
./qa_report -a "b c" -d "e f" "milestone"
What does the eval command do with that? It passes it to the shell for parsing! In this pass, "./qa_report" is the command, and the rest are the parameters. Since the shell is parsing it from scratch, it will group "b c" as a single entity, letting the "-a" option pick it up as a single string.

Monday, February 3, 2014

Syns, Syn Cookies, TCP Listen Backlog: More Complicated than You Think

No, syncookies don't have anything to do with dieting. But they did come up as I learned that the TCP listen backlog is more complicated than I thought. This article should help those of you trying to support TCP servers with lots of clients, especially if large numbers of clients can try to connect at the same time. (For example, a popular web server.) This is Linux-oriented; I'm not sure how applicable the info is for other OSes.

---

Here is an article which talks about the TCP listen backlog. Here are some quotes:

The backlog has an effect on the maximum rate at which a server can accept new TCP connections on a socket. ... Many systems (particularly BSD-derived or influenced) silently truncate this value (the backlog parameter to the listen() system call) to 5 — version 1.2.13 of the Linux kernel [really old - SF] does this ... Using small values for the listen backlog was one of the major causes of poor web server performance with many operating systems up until recently. ... The backlog parameter is silently truncated to SOMAXCONN ... defined as 128 in /usr/src/linux/socket.h for 2.x kernels. ---

Here is a brilliant writeup that taught me about "syncookies", and how they can lead to hung clients. Basically, if the listen backlog (a.k.a. the SYN queue) fills up and more client connection requests (SYNs) come in, the server will *act* like it is accepting them by responding with syncookies. But the kernel won't actually set up state for those connections or inform the app of the new connection. Instead, the server waits for the client to respond with the ACK (the third step of the 3-way handshake). That ACK contains enough information for the server to reconstruct the initial SYN, and the kernel proceeds to open the connection as normal. HOWEVER, if the client's ACK gets lost in the switch or the NIC or whatever, then the client will be left thinking the connection was accepted and is ready, and the server will have no memory of it.

This leads to a genuine hang if the application protocol depends on the server sending the first message, like SMTP or MySQL. In these cases, the client app will hang forever waiting for the server to send its message.

---

Here is an article which gives advice on how to set up systems that can accept lots of TCP connections. Here's a quote:

/proc/sys/fs/file-max: The maximum number of concurrently open files. We recommend a limit of at least 32,832.
/proc/sys/net/ipv4/tcp_max_syn_backlog: Maximum number of remembered connection requests, which are still did not receive an acknowledgment from connecting client. The default value is 1024 for systems with more than 128Mb of memory, and 128 for low memory machines. If server suffers of overload, try to increase this number.
/proc/sys/net/core/somaxconn: Limit of socket listen() backlog, known in userspace as SOMAXCONN. Defaults to 128. The value should be raised substantially to support bursts of request. For example, to support a burst of 1024 requests, set somaxconn to 1024.

Here are some commands I entered on our host Saturn:

   sford@Saturn$ cat /proc/sys/net/core/somaxconn
   128
   sford@Saturn$ cat /proc/sys/net/ipv4/tcp_max_syn_backlog
   2048
   sford@Saturn$ cat /proc/sys/fs/file-max
   3263962
   sford@Saturn$

Looks like the main thing we need to do is increase somaxconn, and maybe tcp_max_syn_backlog as well.

---

One small concern. I saw various references to SOMAXCONN as being a constant in a system include file. It apparently lives in different places, depending on the OS flavor/version; I found it here:

   sford@Saturn$ find /usr/include | xargs egrep SOMAXCONN
   /usr/include/bits/socket.h:#define SOMAXCONN 128

So now the question becomes, if we update the tuning parameter, do we also have to modify the include file? My gut says no. I'm thinking that maybe if you built the kernel from source, you would perhaps change the default via that include, but on a running system you simply override that default and you can magically use larger numbers in the listen() call.

Socket buffers: more complicated than you think

If you are receiving UDP datagrams (multicast or unicast, no difference), how much socket buffer does a datagram consume? I.e. how many datagrams of a particular size can you fit in a socket buffer configured for a given size?

Well ... it's complicated.

I've tried some experiments on two of our Linux systems, and encountered some surprises. Note that my experiments were performed with modified versions of the msend and mdump tools, i.e. simple UDP with no higher-level protocol on top of it. (See my Github project for my modified versions.) The modified mdump command sets up the socket, prints a prompt, and waits for the user to hit return before entering the receive loop. I had msend sending 500 messages with 10 ms between sends (nice and slow so as not to overrun the NIC). Since the mdump is not yet in its receive loop, the datagrams are stored in the socket buffer. When the send finishes, I hit return on mdump, which enters the receive loop and empties the socket buffer, collecting statistics. Then I hit control-c on mdump, and it reports the number of messages and bytes received. Finally, I did experiments on both unicast and multicast; the results are the same.

Here are some results for a two-system test, sending from host "orion", receiving on host "saturn". The message sizes and bytes received shown are for UDP payload. Receive socket buffer configured for 100,000 bytes. Note that 1472 is the largest UDP payload which can be sent in a single ethernet frame (i.e. no IP fragmentation).

message size	messages received	bytes received
1472	61	89792
215	61	13115
214	157	33598
1	157	157

Interesting. The number of messages seems to not depend on message size, except for a discontinuity at 215 bytes. I checked a lot of other message sizes, and they all follow the pattern: 61 messages for sizes >= 215, 157 messages for sizes <= 214.

Now let's double the receiver socket to 200,000 bytes:

message size	messages received	bytes received
1472	121	178112
215	121	26015
214	313	66982
1	313	313

The messages received are approximately doubled, with the discontinuity at the exact same message size. Cutting the original socket buffer in half to 50,000 approximately cuts the message counts in half, with the discontinuity at the same place (I won't bother including the table).

Now lets switch the roles: send from saturn, receive on orion. Socket buffer back to 100,000 bytes.

message size	messages received	bytes received
1472	77	113344
215	77	16555
214	363	77682
1	363	363

The discontinuity is at the same place, but different numbers of messages are received. The linux kernel versions are very close to the same - Saturn is 2.6.32-358.6.1.el6.x86_64 and orion is 2.6.32-431.1.2.0.1.el6.x86_64. Both systems have 32 gig of memory and are using Intel 82576 NICs. Saturn has 2 physical CPUs with 6 cores each, and orion has 2 physical CPUs with 4 cores each and hyperthreading turned on. I'm don't know why they hold different numbers of messages in the same-sized socket buffer.

These machines also have 10G Solarflare NICs in them, so let's give that a try. Send from saturn, receive on orion, socket buffer 100,000 bytes.

message size	messages received	bytes received
1472	110	161920
1	110	110

Whoa! That's right - when using the Solarflare card, the socket buffer held more bytes of data than the configured socket buffer size! But this isn't necessarily unexpected; the man page for socket(7) says this about setting the receive socket buffer: "The kernel doubles this value (to allow space for bookkeeping overhead)". Finally, it's interesting that there is no discontinuity - 110 messages, regardless of size.

Let's stick with the Solarflare cards, and go back to orion sending, saturn receiving (still 100,000 byte socket buffer):

message size	messages received	bytes received
1472	87	128064
1	87	87

Fewer messages, but still exceeds 100,000 bytes worth with large messages.

Now let's put both sender and receiver on saturn (loopback), with 100,000 byte socket buffer:

message size	messages received	bytes received
1472	87	128064
582	87	50634
581	157	91217
70	157	10990
69	261	18009
1	261	261

Lookie there! Two discontinuities.

Someday maybe I'll try this on other OSes (our lab has Windows, Linux, Solaris, HP-UX, AIX, FreeBSD, MacOS). Don't hold your breath. :-)

I did try a bit with TCP instead of UDP. It's a little trickier since instead of generating loss, TCP flow controls. And you have to take into account the send-side socket buffer. And I wanted to force small segments (packets), so I set the TCP_NODELAY socket option (to disable Nagle's algorithm). The results were much more what one might expect - the amount buffered depended very little on the segment size. With 1400-byte messages, it buffered 141,400 bytes. With 100-byte messages, it buffered 139,400 messages. I suspect the reduction is due to more overhead bytes. (I didn't try it with different NICs or different hosts.)

The moral of the story is: the socket buffer won't hold as much UDP data as you think it will, especially when using small messages.

UPDATE: on a colleague's suggestion, I looked at the "recv-Q" values reported by netstat. On Linux, I sent a single UDP datagram with one payload byte. The "recv-Q" value reported was 1280 for an Intel NIC, and 2304 for a Solarflare NIC. When I set the socket buffer to 100,000 bytes and fill it with UDP datagrams, "recv-Q" reports a bit over 200,000 bytes - double the socket buffer size I specified. (Remember that socket(7) says that the kernel doubles the buffer size to allow space for bookkeeping overhead.)

UPDATE2:I'm not the first one to wonder about this. See https://www.unixguide.net/network/socketfaq/5.9 (that info is for BSD, not Linux).

The Meditative Coder