Wednesday, September 27, 2017

Is a lot of spam our own damn faults?

I got an unsolicited sales inquiry from a major company the other day.  Each day, 10 to 20 junk emails make it through our spam filter.  Usually, I can delete them after only a second or two, but this one sounded like I might already have a business relationship with them.  I don't want to risk insulting a customer or vendor, so I responded, asking what it was about.  The salesman was honest; he said that he thought somebody with my title would be interested.  I wasn't.  Not even close.

I've been on the Internet since it became easy to get on it.  When did it become acceptable to send blind solicitations?  When did the word "spam" come to mean only Nigerian princes and phishing schemes?  It used to be only desperate, border-line ethical, fly-by-night companies that sent junk email.  Now it's Box, Oracle, Microsoft, hell, I'm pretty sure my own employer does it!  Why have mainstream companies sunk so low as to send solicitations based on title?

Think back (if you're old enough) 20 years.  There were trade magazines that you could get "for free".  All you had to do is fill out a sheet that indicated in fair detail what your interests were, what industry you worked in, and the kinds of products over which you have purchase influence.  Vendors got very precisely-targeted lists, and we all knew that we would be getting solicitations.  We valued the magazine, so we didn't resent the ads.  Heck, although I don't remember specifically, I suspect I responded positively to one or two solicitations; the advertiser got their money's worth and I got a product that I wanted.

Those magazines don't exist any more, or at least not in my field.  We've all stopped reading the paper versions and instead look to the web for the information we're interested in.  We subscribe to blogs,  podcasts, slash-dot, LinkedIn groups, and any number of other curated content providers.  But the Internet evolved from an early non-commercial birth.  Early adopters resented the commercialization of the Internet, and refused to give information about themselves.  We create throw-away email addresses to subscribe.  We want to remain anonymous.  So the information curators never established the model of "you tell me about yourself for marketing purposes, and I'll give you information you want."  Some companies tried to get that going, but the internet "culture" prevented it from catching on.

So guess what?  I and my fellow-junk-email-haters are suffering from the unintended consequences of our own behavior.  Vendors no longer have precisely-targeted lists available to them.  So they substitute quantity for quality; send a million emails, and you're sure to find some prospects.  It's the new normal.

Idealists like me want a total paradigm change.  We want unsolicited advertisements to go away completely.  Back in the day, if I knew I wanted a C compiler, what did I do?  Open the yellow pages?  Sorry, no entries in the Yellow Pages for C compilers.  No, I *depended* on those trade magazines' advertisers to give me access to vendors of C compilers.  But now that search engines exist, we can do away with outgoing advertisements.  Instead of push marketing, go with pull marketing.  If I want a C compiler, I won't open my "junk" folder to find an unsolicited ad, I'll do a web search.  And this model *does* work!  We put some useful information on our web site, and attracted more than one customer who came for that information and stayed for our product.

And yet, the realist in me knows that human nature is what it is.  Research has proven again and again that advertising works.  I suspect modern email campaigns generate a lot of "unsubscribe me" responses, some of which may be less than polite, but I also suspect that they generate at least some interest.  Cast a wide-enough net and you'll catch some fish.

So if I have an emotional response to junk mail that is out of proportion to it's actual cost to me, that's my problem, not the advertisers.  I guess I need to get over it.

Thursday, September 21, 2017

Solaris Multicast Deafness Bug

Once again, the mighty Dave Zabel (of two different fames) has found another Multicast-related bug, this time in Solaris.  I think that recent versions of Solaris fix it, and I don't have the energy to track down *when* they fixed it, but if you have Solaris servers that you haven't kept updated for a while, you might have this bug.


You'll need Informatica's "mtools" package for Solaris.  These are great tools offered for free in both binary and source form at

And you'll need two hosts: A and B.  Host B should be Solaris 6.10 that hasn't been updated in a long time.  Host A can be anything.

1. On host A, run this:
    msend 12000 15

2. On host B, open two windows.  In the first, enter this:
    mdump 12000

Admire the printouts of the multicast packets for a while.  Isn't technology wonderful?  :-)

3. In a second host B window, enter:
    mdump 12000

Note that the first window continues to print, but the second window is silent.  No surprise; it is listening to a different and unused multicast group!  Of course it is silent.

4. Kill that second mdump.

WHOA!  The first mdump stops printing!  It went deaf to  When trying this same experiment on Linux, or on our Solaris 5.11 machines, it does not go deaf.  But we have several old, non-updated 5.10 machines where the first mdump does go deaf on this step.

5. Enter:
    netstat -g

The OS still thinks it is listening to the multicast group.

6. Enter:
    snoop -P host

The packets are still being received!  But they aren't being delivered to the first mdump.

7. Enter:
    mdump 12000

WHOA!  The first mdump starts printing again!  The second mdump is still silent since there still isn't any traffic on its multicast group.


Maybe PEBKAC?  Or a bug in mtools?

Nope.  Let's start over and try it again with a small change in step 3:

1. On host A, run this:
    msend 12000 15

2. On host B, open two windows.  In the first, enter this:
    mdump 12000

3. In a second host B window, enter:
    mdump 12000

See what I did there?  I changed the 128 to 64.  As before, the first window continues to print, but the second window is silent.

4. Kill that second mdump.

Lookie there!  The first mdump continues to print the messages.  No deafness.


Well, I'm not sure, but I think it's got to be related to multicast group aliasing.  Remember that there are 2**28 different IP multicast groups.  But what about Ethernet?  There are only 2**23 Ethernet multicast MAC addresses allocated for use by IP multicast.  It turns out that and map to the same Ethernet multicast MAC address: 01-00-5E-00-03-13.

The IGMP protocol doesn't care about that; host B still tells the switch which multicast groups are subscribed, and it treats and as different.  But when the IP layer interfaces with Ethernet, it needs to program the NIC with the same multicast MAC address for those two IP groups.  And apparently older versions of Solaris didn't do the book keeping right.

I've tried this experiment on other OSes and they all work as you would expect (no deafness).  Our Solaris 5.11 machine does it right.  And even a recently-installed 5.10 system works right.  But older systems that haven't been updated in a while all have this problem.


The obvious moral is to update your systems.

But even then, you should avoid using multicast groups that alias on top of each other.  The whole point of multicast is that you don't receive packets that you aren't interested in.  But if you have traffic published to both and,  a host subscribing to only one of them will get data for both.  The IP layer will do the right thing (discard the undesired packets), but it still produces an unnecessary load.


Sure.  Watch out for well-known and ad-hoc multicast protocols in the range -  Are any of those in use anywhere on your network?  No?  Are you sure they never will be?

Look at the multicast group we tested with:  That aliases on top of, which is in an ad-hoc1 range labeled "RFE Generic Service".  I don't know what that is (and Google doesn't seem to know either), but I'm thinking I want to avoid aliasing, even if low probability.

You should be fine if you use multicast groups between -

Oh, and update your systems too.  Good hygiene and all that.


We've upgraded one of our "problem servers" to the latest Solaris 5.11 and it fixed the deafness problem.

I'm not interested in figuring out exactly which minor release they fixed it in.

Saturday, July 8, 2017

Most Random Password Generators are Bad

Good for you!  You're taking the advice of experts and clicking "Generate Password", resulting in 10 characters of gibberish.  There!  Now your password will take thousands of years to crack.

Um ... not necessarily.  Try a few days.


When using the pseudo-random number generator supplied by most language libraries, the entropy of the resulting password is limited to 32 bits!

Let's take XKCD's algorithm: ~2000 word dictionary, randomly select 4 words, produces 2000**4 different possible passwords, which is 16 trillion.  Log base 2 gives 43.9 bits of entropy.

But using a pseudo-random number generator with a 32-bit initial seed means that it will only generate 2**32 different sequences, or 4 billion.  That's .027% of the total!  In other words, 99.97% of the possible XKCD-style passwords CANNOT BE GENERATED by that program!

Normally, you can add more bits of entropy by either expanding the dictionary size (the number of words to choose from), or increasing the number of words in the password.  But because of the pseudo-random number generator, you are STUCK at 32 bits of entropy.  An attacker could even pre-generate the 4 billion possible XKCD-style passwords that a standard Linux rand() produces.

My point is not that 32 bits bits of entropy aren't enough, it's that you aren't necessarily going to get what you think you're getting if you use the stock pseudo-random number generator.

So if you're running somebody's application and you click "generate random password" and you see a string of gibberish that claims to have a crack time of thousands of years, it is probably wrong.  32 bits of entropy at 1000 guesses per second has a brute force crack time of under 50 days.  (And modern crackers go MUCH faster than 1000/sec.)


For my program, I offer the "-r" option, which reaches out to to get random numbers.  It doesn't need very many -- you only need 4 random numbers to generate an XKCD-style password -- but the important feature is that is truly random.  There is no seed.  Each number is uniformly random and independent from the previous number.  (Or at least, so claims the owner of  I'm pretty sure this removes any artificial limit on entropy, so you can get as much entropy as you want by increasing the dictionary size and/or the number of words in the password.

Using is not the only way to solve this problem.  Have you ever generated an SSL certificate?  It can take several seconds while the software "generates" enough entropy for long key lengths.  I'm not personally familiar with how that is done, but I've heard the the OS uses external physical events, like keystrokes, network interrupts, etc.  I think I've heard that it also uses disk interrupts, which makes me wonder if SSD drives make it harder for kernels to generate entropy.

If you're going to be demanding a lot of entropy for your application, you should not abuse  Instead learn how to use locally-generated entropy.

The vast majority of on-line password generators are written in Javascript.  I'm not sure how to get truly random numbers (i.e. entropy) in Javascript, but this might be a good starting point.

(By the way, a bit of reading on my part shows me that I have a lot to learn.  But my reading to date does reinforce my primary point: simply using rand() or a similar/derived function does not produce passwords that take thousands of years to crack.  At best they rely on "security through obscurity".)


I'm a little worried about the "random password generators" included in password managers.  The idea is that you should have a different password for every on-line account, and you let the password manager deal with the hundreds of passwords you end up with.  Since you don't need to remember, or even type those passwords, you might as well make them be random character gibberish.  Only the password manager's master password needs to be memorized.

However, if your password manager just uses the normal pseudo-random number generator in the system, that sequence of random characters will not have as much entropy as you think.  I can tell you that LastPass's online password generator just uses Javascript's get_random() function, which only has 32-bits of entropy.  Now maybe their laptop application uses /dev/random, but also maybe the fact that their on-line generator uses built-in random indicates they didn't give the issue much thought.

I haven't done an exhaustive search, but I would wager that 90% of "generate password" functions just use the language's default random number generator, which has a 32-bit seed (or less!).

My suggestion is to use to generate your gibberish passwords, or my program to generate XKCD-style passwords.


The rand() man page says that Linux rand() uses the same algorithm as random().  And srandom()'s man page says that the seed is an unsigned int, which is 32 bits.  It also says:
The period of this random number generator is very large,
approximately 16 * ((2^31) - 1).
I.e. the period is approximately 34 billion, which is about 35 bits.  But the seed is 32 bits.  This means that you cannot start the random number generator at any arbitrary point in its period.  Even if you figure out a way to fully-leverage all 35 bits of random()'s period, that still gives you a crack time of 397 days, at 1000 guesses/sec.  And by the way, modern password crackers go much faster than 1000/sec.

XKCD-style Password Generator

I got to thinking about passwords again today.

I wrote my own program to produce XKCD-style passwords from a list of 2126 common words, and calculated some stats.  I reproduced XKCD's calculation of 44 bits of entropy for 4 randomly-selected words.  And I made a few mildly-interesting discoveries, and one more-interesting realization.


My average XKCD-style password length is 20.8 letters (over a large sample), which is a lot of typing.  So I decided to limit word size.  By filtering my list of 2129 words to nothing longer than 4 letters, I ended up with 709 words.  That's not many, and 4 of them together only gives 37 bits of entropy.  Not so hot.  But if you string 5 of 709 words together, you get 47 bits of entropy, which is better than XKCD!  And the average password drops to 18.3 characters.

I find that interesting: shorter passwords which produce more bits of entropy than longer passwords.  Seems counter-intuitive, until you realize that opening it up to the full 2129 words increases average word length more than it increases entropy.  (See below for the math.)


So, what do these passwords look like?  Here's 10 of the XKCD-style: 4 words from the 2126 word set:

password: MostlyRelativeSpinAdvanced
length: 26
password: ForBasicallyThinkingExplain
length: 27
password: CookieArmyMysteryConference
length: 27
password: ExpectConvertQuarterbackPresentation
length: 36
password: EverybodyProductHotDemonstrate
length: 30
password: RockIndexWellFloat
length: 18
password: BehaviorNearlyPromotePercentage
length: 31
password: PocketSurviveFourLab
length: 20
password: MuchWeekWillAnd
length: 15
password: DivideMorePeakSeveral
length: 21

So, how easy are those to remember?  Memorizing an XKCD-style password is about creating a mental picture or story around it.  Use some imagination.  It usually helps to make it amusing.  How about the first one: "MostlyRelativeSpinAdvanced"?  Well, I'm a bit of a science geek, so this one makes perfect sense.  You have a particle stream, and most of the particles are moving at relativistic speeds.  So measuring each particle's spin is a pretty advanced thing to do.  Hmm ... what's the amusing part of that?  Oh maybe that an actual physicist would roll his eyes at my explanation and say that I don't know the first thing about particle physics.  But basically, I was able to imagine a mini-story or mental picture for each of those passwords, so while I might not be able to memorize all 10, I could easily memorize one of them.

What about the shorter passwords consisting of 5 words from the 709 words of 4 letters or less?

password: BuyWingSadRideSeed
length: 18
password: SeedPairTankJailDo
length: 18
password: PanBuryDenyDataOld
length: 18
password: GeneRiceTeaYetSin
length: 17
password: WallJailLabNextTent
length: 19
password: HallSnapCashRichRead
length: 20
password: WarmUsKeepRoseLess
length: 18
password: PortMarkSirYouLeaf
length: 18
password: HiAgoHipAnyBe
length: 13
password: EaseSkyRealTossFate
length: 19

Even though those are shorter (and more secure) passwords, I guess I find them more difficult to remember them.  It's about creating a mental picture or story around those words.  Since the words are random, they don't come out in any conceptually correlated way.  So you stretch your imagination to encompass them.  The more words in the password, the more you have to stretch.

Take the last password up there, "EaseSkyRealTossFate", and drop that last word to make it 4 words: "Ease sky real toss".  My first thought is that "toss" is the children's game "ring toss".  Sky and ease kind of fit since the game is usually pretty easy and you toss things towards the sky.  The word "real" is kind of left out, but I imagine throwing something "real", like a laptop or a dinner plate, instead of a game piece.  So I imagine the ease of tossing a real laptop into the sky.  Yeah, that's stretching the imagination a bit, but maybe not too much.  I could pretty easily remember EaseSkyRealToss.

But now throw "fate" in there and my whole mental picture falls apart.  I guess I could say that when the laptop lands, its fate will be sealed, but ... not sure why ... but I would have much more trouble remembering it.

So I'll be sticking to 4 words and more typing.


Passwords are basically taking a set of N things, and taking L of them out with replacement.  For example, a 4-digit PIN consists of a set of 10 digits (N=10), and you take 4 digits out (L=4) with replacement.  The "with replacement" simply means that you might take the same digit out more than once (e.g. 2338).  So the entropy of a 4-digit PIN is 10**4 *(10 to the power of 4), which is 10,000.  To get that in terms of bits, take the log base 2 of it to get 13.  So 13 bits of entropy.

Another example: 8 randomly-selected letters for a password.  Let's assume lower-case only, and no digits or special characters.  The set of N things is the letters of the alphabet, so N=26.  By taking 8 characters, L=8.  26**8 = 208 billion.  Log base 2 of that is 37 bits of entropy.  Cool.  Now let's do random upper/lower case.  N=52, and 52**8 = 53 trillion, giving 45.6 bits of entropy.  Add in 0-9: N=62, 62**8 = 218 trillion, giving 47.6 bits of entropy.

So, back to my XKCD-style passwords.  My original set of 2126 words, taking 4 at a time, gives 2126**4 = 20 trillion, which is 44.2 bits.  My reduced set of 709 short words, taking 5 at a time gives 709**5 = 179 trillion, which is 47.3 bits.

However, see my next blog post for an observation about random password generators and entropy.


My list of 2126 words actually comes from a list of 3000 words from Education First.  I filtered it to limit word length to 7 or fewer characters, resulting in my 2126 words.  Note to the rigorous: you'll find that I'm 2 words short; it made my code easier to ignore the first and last words.

So how about if I remove that filter and pick 4 words from the entire set of 2998?

2998 ** 4 = 80 trillion, which is 46 bits.  I.e. going from 2126 words to 2998 increases the entropy by 2 bits.  My average password length jumps to 25.6, which is 5 more characters.  I tried a few other word length limits and decided that 7 is best.


See Be sure to use "-r" if generating an actual password you want to use.

Monday, May 22, 2017

Some multicast programming tips

Never too old to learn.  :-)

There are lots of multicast example programs out there, so I won't try to compete with them.  But I did run across several things that weren't explained very well.

Single Socket, Multiple Groups

Yes, you can create a single socket and have it receive datagrams from multiple multicast groups.  Just include multiple calls to:
  setsockopt(recv_sock, IPPROTO_IP, IP_ADD_MEMBERSHIP, ...

Multiple Sockets, One Group per Socket

This is another common use case, where you create multiple sockets for receiving, with each socket joined to a different multicast group.

Binding the Receive Socket

Since a socket needs to be bound to a port to receive any kind of UDP datagram, multicast or unicast, you need to include a call to bind().  You pass in a sockaddr_in with the sin_port set as desired (remember to pass it in network order).  But what about the sin_addr?  What do you set that to?

Many people set it to INADDR_ANY, which is what I did in a recent program.  But in the multiple sockets, one group per socket case, it had an unexpected side effect.  All of my sockets were bound to the same destination port, but joined to different multicast groups.  With sin_addr set to INADDR_ANY, the kernel replicated the received datagrams and delivered a copy to *every* socket! I.e. simply doing the IP_ADD_MEMBERSHIP didn't do any filtering.  When a multicast datagram was received, the kernel just used the destination port and delivered a copy to every socket.

I had to do some extra searching to find out that you can set the sin_addr to the multicast group.  I have some reason to suspect that this is not portable across all operating systems, but at least it works on Linux.  Now I can have 10 sockets, each bound to the same port (don't forget SO_REUSEADDR) but different multicast groups.  When a multicast datagram is received, it is delivered *only* to the socket which is bound to the right port/multicast group pair.

Single Socket, Multiple Groups, reprise

So, what about the case where you have a single socket joined to multiple groups?  In that case, you *do* want to use INADDR_ANY in the bind.

Mix and Match?

I guess this poses a restriction.  You can't have, say, 2 sockets that you distribute 4 multicast groups across, with two groups each.  Why would you want to do that?  Maybe to load-balance across threads.  But assuming they all want to bind to the same port, you can't do it.  Setting the sin_addr to INADDR_ANY will mean that both sockets will receive a copy of each datagram sent. But you can't set sin_addr to multiple multicast groups.

So if you want to have multiple sockets, you need to have one group per socket, and bind that socket to the group.

Monday, May 15, 2017

WannaCrypt / WannaCry ransomeware

I'm not a security researcher, and I don't follow the subject very closely.  But here is an interesting read by the person who slowed the spread of the recent WannaCrypt / WannaCry ransomware outbreak.

Sunday, April 30, 2017

Fraudulent spam email claiming to be Netflix

I got a phishing email.  So what?  I get lots of phishing emails.  Why blog about this one?

Well, it's at least a *little* different.

Most of them direct the victim to an existing web site which has been compromised.  I.e. the web site's real owner has no idea that his own site is being used for fraudulent purposes.

In this one, the victim is directed to the domain name "", which the scammer obtained properly.  Unfortunately, the scammer wasn't stupid enough to include his own contact information in the registry, instead choosing to hide behind

Now there's nothing wrong with using to hide one's identity.  If anything, it removed any doubt in my mind (as if there were any) that the page isn't owned by Netflix.  So it reinforced that it is a phishing site.  I sent a complaint email to anyway.

Next up, the domain the registry:  Never heard of them.  Malaysian.  Sent them a complaint email too to suspend the registration.

Next, the IP address that resolves to:  A whois lookup shows the block is owned by Quasi Networks LTD.  Abuse email to it as well.

Now to another nice site:, a site that evaluates how likely a site is to be fraudulent.  It actually goes to the site and analyzes it.  So I went there and plugged in "", and sure enough, it says that it is probably a phishing site (no surprise there).  But on that page is a tab named "resources", which shows details of the access to the site ... and well lookie there, "" redirects to "".  Which resolves to the same IP as "", and is registered in the same ways ( and  So what the point in that?  Oh well, another set of complaint emails for the new domain name.

Finally, let's see if it is a compromised web site.  I would like to see what other domain names resolve to the same IP address.  Unfortunately, this appears not to be an exact science.  The few sites there are that claim to do this find *no* domains resolving to that IP.  However, a simple google search for "" (*with* the double quotes) does find the names "" and a new one: "".

Yep.  Another phishing site, leveraging Apple instead of Netflix.  Let's do the drill, starting with whois.  WHOA!!!  Did we hit paydirt?

Registrant Contact
Name: Jamie Wilson
Mailing Address: 22 Madisson Road, London London SE12 8DH GB
Phone: +44.07873394485
Fax Ext:

Now, don't be too hasty.  The *real* registrant is a scammer.  What are the chances he would list his own real contact info?  The only thing that might be valid is the email address, since I think he needs that to fully set up the domain, and even then it might have been a single-use throwaway.

Hmm ... not totally throw-away.  A google of "" has 6 hits, including "" and "", both of which have Jamie as the registrant, but neither of which resolve to valid IP addresses.  So not sure there's anything actionable (i.e. complainable) there.

But just in case, I googled the phone number, and found this additional hit: "", which doesn't appear to resolve to a valid IP.

Well, much as I hate to, let's skate over to "", which wants my money in a bad way.  It tells me that is associated with ~38 domains, but of course won't tell me what any of them are without paying them $99.  And even though I would love to send complaints regarding all 38, I wouldn't love it $99 worth.

Ok, one more thing. says that the owner of that email address is Adam Stormont, and that the email is associated with a few other sites (but not 37), including "", which doesn't resolve to an IP.  And by the way, a whois of another domain, "", says that the registrant is David Hassleman.  So yeah, ignore the Jamie Wilson contact.  He wasn't that stupid.  :-)

And now I've run out of gas.  Maybe those domain names will be disabled in the next few days.  Or maybe I've just wasted a half hour of my life.  (Well, I've learned a few things, so not totally wasted.)

Friday, March 31, 2017

Cisco Eating Multicast Fragments???

UPDATE: after upgrading the IOS our "MDF" switch, this problem went away.  None of my readers (all 2 of them?) have reported seeing this problem with their switches.  So I think this issue is closed.

I think we've discovered a bug in our Cisco switch related to UDP multicast and IP fragmentation.  Dave Zabel (of Windows corrupting UDP fame) did the initial detective work, and I did most of the analysis.  And I'm not quite ready to declare victory yet, but I'm pretty sure we know roughly what is going on.


It appears that Cisco is not paying proper attention to whether a packet is fragmented when checking the UDP destination port for the BFD protocol.  The result is that it eats user packets that it misidentifies as being part of that protocol.


We have 4 Catalyst 3560 "LAB" switches (48 port) trunked to a Catalyst 4507 "MDF" switch.  Our lab test machines are distributed across the LAB switches.

Our messaging software multicasts UDP datagrams.  One of our regression tests involves sending messages of varying sizes with randomized data.  We saw that occasionally, one of the messages would be lost.  Doing packet captures showed that the missing datagram is NAKed and retransmitted multiple times, but the subscribing host never saw the datagram, even though it saw all the previous and subsequent datagrams.  (This particular test does not send at a particularly stressful rate.)

Further investigation showed that some hosts always got the message in question, while others never got the message.  Turns out that the hosts that got the message were on the same LAB switch as the sender.  The hosts that didn't get the message were on a different switch.

I narrowed it down to a minimal test datagram of 1476 bytes.  The first 1474 bytes can be any arbitrary values, but the last two bytes had to be either "0e c8" or "0e c9".  Any datagram with either of those two problematic byte pairs at that offset will be lost.  Note that the datagram will be split into 2 packets (IP fragments) by the sending host's IP stack.  Strategically placed tcpdumps indicated that the first IP fragment always makes it to the receiver, but the second one seems to be eaten by our "MDF" switch.

There's nothing magic about the size 1476 - it can be larger and the problem still happens.  1476 is just the smallest datagram which demonstrates the problem.


IP fragmentation happens when UDP hands to IP a datagram that doesn't fit into a single MTU-sized Ethernet packet (1500 bytes).  A UDP datagram consists of an 8-byte header, followed by up to 65,527 bytes of UDP payload.  IP splits a large datagram up into fragments of 1480 bytes each and prepends its own 20-byte IP header to each fragment.  But note that only the first fragment will contain the UDP header.  So IP fragment #1 will hold the 8-byte UDP header and the first 1472 bytes of my datagram.

Since my test datagram is 1476 bytes long, IP fragment #2 will contain a 20-byte IP header followed by the last 4 bytes of my datagram.

I won't show you the first fragment of my test datagram because it's long and boring.  And it is successfully handled by Cisco, so it's also not relevant.

Here's a tcpdump of the second fragment of my test datagram (test datagram bytes highlighted).  Note that tcpdump includes a 14-byte Ethernet header in front of the 20-byte IP header, then the last 4 bytes of my test datagram, and finally 22 padding nulls to make up a minimum-size packet (those nulls are not counted as part of the IP payload).

07:56:38.518614 00:1e:c9:4e:a1:92 (oui Unknown) > 01:00:5e:65:03:01 (oui Unknown), ethertype IPv4 (0x0800), length 60: (tos 0x0, ttl   2, id 2132, offset 1480, flags [none], proto: UDP (17), length: 24) > udp
        0x0000:  0100 5e65 0301 001e c94e a192 0800 4500  ..^e.....N....E.
        0x0010:  0018 0854 00b9 0211 afed 0a1d 0358 ef65  ...T.........X.e
        0x0020:  0301 0000 0ec8 0000 0000 0000 0000 0000  ................
        0x0030:  0000 0000 0000 0000 0000 0000            ............

This is the packet which is successfully received by hosts on the same switch as the sender, but is never received by hosts on a different switch.  Change the "0e c8" byte pair to, for example, "1e c8" or "0e c7" and everything works fine - the packet is properly forwarded.


In my problematic datagram, the last 4 bytes occupy the same packet position in fragment #2 as the UDP header in a non-fragmented packet.  In particular, the byte pair "0e c8" occupies the same packet position as the UDP destination port in a non-fragmented packet.  Those byte values correspond to port 3784, which is used by the BFD protocol.  BFD is used to quickly detect failures in the path between adjacent forwarding switches and routers, so it is of special interest to our switches.  (The other problematic byte pair "0e c9" corresponds to port 3785, which is also used by BFD.)

So, when a LAB switch sends fragment #2 to the MDF, it looks like MDF is checking the UDP port WITHOUT looking at the IP header's "Fragment Offset" field.  It should only look for UDP port if the fragment offset is zero.  Here's that packet again with the fragment offset highlighted:

07:56:38.518614 00:1e:c9:4e:a1:92 (oui Unknown) > 01:00:5e:65:03:01 (oui Unknown), ethertype IPv4 (0x0800), length 60: (tos 0x0, ttl   2, id 2132, offset 1480, flags [none], proto: UDP (17), length: 24) > udp
        0x0000:  0100 5e65 0301 001e c94e a192 0800 4500  ..^e.....N....E.
        0x0010:  0018 0854 00b9 0211 afed 0a1d 0358 ef65  ...T.........X.e
        0x0020:  0301 0000 0ec8 0000 0000 0000 0000 0000  ................
        0x0030:  0000 0000 0000 0000 0000 0000            ............

For most (non-fragmented) packets, that byte will be zero, and the UDP header will be present, in which case the 0ec8 would be the port number.  The highlighted fragment offset of b9 hex is 185 decimal, and IP fragment offset is measured in units of 8-byte blocks, so the actual offset is 8*185=1480, which is tcpdump has for "offset".

It also seems strange to me that the switch ignores which multicast group I'm sending to.  I can send to any valid multicast group, and the problematic packet will be eaten by the "MDF" switch.  Shouldn't there be a specific multicast group for BFD?  Maybe I found 2 bugs?

My employer has a support contract with Cisco, and I'm working with the internal network group to get a Cisco ticket opened.  I'll update as I learn more, but it's slow climbing through the various levels of internal and external tech support, each one of whom starts out with, "are you sure it's plugged in?"  It may take weeks to find somebody who even knows what IP fragmentation is.


I would love to hear from others who can try this out on their own networks.  Grab the source files:

To build on Linux do:
gcc -o msend msend.c
gcc -o mdump mdump.c

Note that I've tried other operating systems (Widows and Solaris), with the same test results.  This is not an OS issue.

For this test, the main purpose of mdump is to get the host to join the multicast group.

Choose three hosts: A, B, and C.  Make sure A and B are on the same switch, and C is on a different switch.  In my case, all three hosts are on the same VLAN; I don't know if that is significant.  For this example, let's assume that the three hosts' IP addresses are,, and respectively, and that all NICs are named "eth0".

Choose a multicast group and UDP port that aren't being used in your network.  I chose and 12000.  I've tried others as well, with the same test results.

Note that the msend and mdump commands require you to put the hosts's primary IP address as the 3rd command-line parameter.  This is because multicast needs to be told explicitly which interface to use (normal IP routing doesn't know the "right" interface to use).

Open a window to A, and two windows each for B and C.  Enter the following commands:

B1: ./mdump 12000

B2: tcpdump -i eth0 -s2000 -vvv -XX -e host

C1: ./mdump 12000

C2: tcpdump -i eth0 -s2000 -vvv -XX -e host

A: ./msend 12000

The "msend" command sends two datagrams.  The first one is small and gives the sending host's name.  The second one is the 1476-byte datagram, whose second fragment gets eaten by the Cisco "MDF" switch.

Window B1 should show both datagrams fully received.

B2 should show 3 packets:
1. The short packet with the host name.
2. Fragment #1 of the long packet
3. Fragment #2 of the long packet

C1 should only show the first datagram.

C2 should show 2 packets:
1. The short packet with the host name.
2. Fragment #1 of the long packet.

Fragment #2 is missing from C2, presumably eaten by the "MDF" switch.

Note that the two "tcpdump" windows might show additional packets, which are for the "igmp" protocol, and are unrelated to the test.  If I had more time, I would figure out how to get "tcpdump" to ignore them.

Friday, November 18, 2016

Linux network stack tuning

Found a nice blog post that talks about tuning the Linux network stack:

I notice that it doesn't talk about pinning interrupts to the NUMA zone that the NIC is physically connected to (or at least "NUMA" doesn't appear in the post), so it doesn't have *everything*.  :-)

And it also doesn't mention kernel bypass libraries, like OpenOnload.

But it has a lot of stuff in it.

Wednesday, September 21, 2016

Review: Prairie Burn (Jazz)

I realize that this is a technical blog without many followers, but I'm really getting into a new album and wanted to share.  If you're not interested in Jazz music, you may stop reading.

Prairie Burn is a new CD by the Mara Rosenbloom Trio.  It is modern Jazz, so don't expect anything that sounds like Tommy Dorsey, or dixieland.  Unfortunately, I don't have the background or the vocabulary to be able to tell you what it *does* sound like.  In other words, this is the worst Jazz review ever.

But since when has that ever stopped me?  :-)

Prairie Burn is great.  Listening to it takes me on an emotional journey including stops at agitation, surprise, excitement, dreaming, and satisfaction.  This music draws me in effortlessly.

So, why am I flogging it in my blog?  For the same reason I flogged Mad and Grace: I like them and I want them to reach their goals.  Yes, Prairie Burn has an Indegogo campaign to raise money for a publicist so that Mara can get more of the attention she deserves.

Jazz has a strange following, few in number but passionate in their dedication.  I've read people bemoan the lack of young talent in the genre.  Most of the well-known artists are getting on in years and won't be around forever.  We've got to find new talent and support it.

I've done the finding for you.  Now it's your turn to help with the supporting.  :-)

Not sure you'll like the music?  The two pieces from Prairie Burn are pretty different from each other, but the second (Turbulence) is probably more representative of the album as a whole.

Full disclosure: Mara is my daughter-in-law.  That certainly influenced me in terms of giving the music a try.  I believe it is not influencing my evaluation of its quality.  This is her third album, and while I like them all, this is the one that I feel passionate about.

Sunday, July 31, 2016

Beginner Shell Script Examples

As I've mentioned, I am the proud father of a C.H.I.P. single-board computer.  I've been playing with it for a while, and have also been participating in the community message board.  I've noticed that there are a lot of beginners there, just learning about Linux.  This collection of techniques assumes you know the basics of shell scripting with BASH.

One of the useful tools I've written is a startup script called "".  Basically, this script blinks CHIP's on-board LED, and also monitors a button to initiate a graceful shutdown.  (It does a bit more too.)  I realized that this script demonstrates several techniques that CHIP beginners might like to see.

The "" script can be found here:  For instructions on how to install and use blink, see

The code fragments included below are largely extracted from the script, with some simplifications.

NOTE: many of the commands shown below require root privilege to work.  It is assumed that the "" script is run as root.

1. Systemd service, which automatically starts at boot, and can be manually started and stopped via simple commands.

I'm not an expert in all things Linux, but I've been told that in Debian-derived Linuxes, "systemd" is how all the cool kids implement services and startup scripts.  No more "rc.local", no run levels, etc.

Fortunately, systemd services are easy to implement.  The program itself doesn't need to do anything special, although you might want to implement a kill signal handler to cleanup when the service is stopped.

You do need a definition file which specifies the command line and dependencies.  It is stored in the /etc/systemd/system directory, named "<sevrice_name>.service".  For example, here's blink's definition file:

$ cat /etc/systemd/system/blink.service 
# blink.service -- version 24-Jul-2016
# See
Description=start blink after boot



When that file is created, you can tell the system to read it with:

sudo systemctl enable /etc/systemd/system/blink.service

Now you can start the service manually with:

sudo service blink start

You can manually stop it with:

sudo service blink stop

Given the way it is defined, it will automatically start at system boot.

2. Shell script which catches kill signals to clean itself up, including the signal that is generated when the service is stopped manually.

The blink script wants to do some cleanup when it is stopped (unexport GPIOs).

trap "blink_stop" 1 2 3 15

where "blink_stop" is a Bash function:

  echo "blink: stopped" `date` >>/var/log/blink.log

where "blink_cleanup" is another Bash function.

This code snippet works if the script is used interactively and stopped with control-C, and also works if the "kill" command is used (but not "kill -9"), and also works when the "service blink stop" command is used.

3. Shell script with simple configuration mechanism.

This technique uses the following code in the main script:

export MON_RESET=
export MON_GPIO=
export MON_GPIO_VALUE=0  # if MON_GPIO supplied, default to active-0.
export BLINK_GPIO=
export DEBUG=

if [ -f /usr/local/etc/blink.cfg ]; then :
  source /usr/local/etc/blink.cfg
else :

The initial export commands define environment variables with default values.  The use of the "source" command causes the /usr/local/etc/blink.cfg to be read by the shell, allowing that file to define shell variables.  In other words, the config file is just another shell script that gets included by blink.  What does that file contain?  Here are its installed defaults:

MON_RESET=1       # Monitor reset button for short press.
#MON_GPIO=XIO_P7   # Which GPIO to monitor.
#MON_GPIO_VALUE=0  # Indicates which value read from MON_GPIO initiates shutdown.
MON_BATTERY=10    # When battery percentage is below this, shut down.
BLINK_STATUS=1    # Blink CHIP's status LED.

4. Shell script that controls CHIP's status LED.

Here's how to turn off CHIP's status LED:

i2cset -f -y 0 0x34 0x93 0

Turn it back on:

i2cset -f -y 0 0x34 0x93 1

This obviously requires that the i2c-tools package is installed:

sudo apt-get install i2c-tools

5. Shell script that controls an external LED connected to a GPIO.

The blink program makes use of the "gpio_sh" package.  Without that package, most programmers refer to gpio port numbers explicitly.  For example, on CHIP the "CSID0" port is assigned the port number 132.  However, this is dangerous because GPIO port numbers can change with new versions of CHIP OS.  In fact, the XIO port numbers DID change between version 4.3 and 4.4, and they may well change again with the next version.

The "gpio_sh" package allows a script to reference GPIO ports symbolically.  So instead of using "132", your script can use "CSID0".  Or, if using an XIO port, use "XIO_P0", which should work for any version of CHIP OS.

Here's how to set up "XIO_P6" as an output and control whatever is connected to it (perhaps an LED):

gpio_export $BLINK_GPIO; ST=$?
if [ $ST -ne 0 ]; then :
  echo "blink: cannot export $BLINK_GPIO"
gpio_direction $BLINK_GPIO out
gpio_output $BLINK_GPIO 1    # turn LED on
gpio_output $BLINK_GPIO 0    # turn LED off
gpio_unexport $MON_GPIO      # done with GPIO, clean it up

6. Shell script that monitors CHIP's reset button for a "short press" and reacts to it.

The small reset button on CHIP is monitored by the AXP209 power controller.  It uses internal hardware timers to determine how long the button is pressed, and can perform different tasks.  When CHIP is turned on, the AXP differentiates between a "short" press (typically a second or less) v.s. a long press (typically more than 8 seconds).  A "long" press triggers a "force off" function, which abruptly cuts power to the rest of CHIP.  A "short" press simply turns on a bit in a status register, which can be monitored from software.

REG4AH=`i2cget -f -y 0 0x34 0x4a`  # Read AXP209 register 4AH
BUTTON=$((REG4AH & 0x02))  # mask off the short press bit
if [ $BUTTON -eq 2 ]; then :
  echo "Button pressed!"

Note that I have not figured out how to turn off that bit.  The "" program does not need to turn it off since it responds to it by shutting CHIP down gracefully.  But if you want to use it for some other function, you'll have to figure out how to clear it.

7. Shell script that monitors a GPIO line, presumably a button but could be something else, and reacts to it.

gpio_export $MON_GPIO; ST=$?
if [ $ST -ne 0 ]; then :
  echo "blink: cannot export $MON_GPIO"
gpio_direction $MON_GPIO in
gpio_input $MON_GPIO; VAL=$?
if [ $VAL -eq 0 ]; then :
  echo "GPIO input is grounded (0)"
gpio_unexport $MON_GPIO      # done with GPIO, clean it up

8. Shell script that monitors the battery charge level, and if it drops below a configured threshold, reacts to it.

This is a bit more subtle that it may seem at first.  Checking the percent charge of the battery is easy:

REGB9H=`i2cget -f -y 0 0x34 0xb9`  # Read AXP209 register B9H
PERC_CHG=$(($REGB9H))  # convert to decimal

But what if no battery is connected?  It reads 0.  How do you differentiate that from having a battery which is discharged?  I don't know of a way to tell the difference.  Another issue is what if a battery is connected and has low charge, but it doesn't matter because CHIP is connected to a power supply and is therefore not at risk of losing power?  Basically, "" only wants to shut down on low battery charge if the battery is actively being used to power CHIP.  So in addition to reading the charge percentage (above), it also checks the battery discharge current:

BAT_IDISCHG_MSB=$(i2cget -y -f 0 0x34 0x7C)
BAT_IDISCHG_LSB=$(i2cget -y -f 0 0x34 0x7D)
BAT_DISCHG_MA=$(( ( ($BAT_IDISCHG_MSB << 5) | ($BAT_IDISCHG_LSB & 0x1F) ) / 2 ))

CHIP draws over 100 mA from the battery, so I check it against 50 mA.  If it is lower than that, then either there is no battery or the battery is not running CHIP:

BAT_IDISCHG_MSB=$(i2cget -y -f 0 0x34 0x7C)
BAT_IDISCHG_LSB=$(i2cget -y -f 0 0x34 0x7D)
BAT_DISCHG_MA=$(( ( ($BAT_IDISCHG_MSB << 5) | ($BAT_IDISCHG_LSB & 0x1F) ) / 2 ))
if [ $BAT_DISCHG_MA -gt 50 ]; then :
  REGB9H=`i2cget -f -y 0 0x34 0xb9`  # Read AXP209 register B9H
  PERC_CHG=$(($REGB9H))  # convert to decimal
  if [ $PERC_CHG -lt 10 ]; then :
    echo "Battery charge level is below 10%"

Sunday, June 26, 2016

snprintf: bug detector or bug preventer?

Pop quiz time!

When you use snprintf() instead of sprintf(), are you:
   A. Writing code that proactively detects bugs.
   B. Writing code that proactively prevents bugs.

Did you answer "B"?  TRICK QUESTION!  The correct answer is:
  C. Writing code that proactively hides bugs.

Here's a short program that takes a directory name as an argument and prints the first line of the file "tst.c" in that directory:
#include <stdio.h>
#include <string.h>
int main(int argc, char **argv)
  char path[20];
  char iline[5];
  snprintf(path, sizeof(path), "%s/tst.c", argv[1]);
  FILE *fp = fopen(path, "r");
  fgets(iline, sizeof(iline), fp);
  printf("iline='%s'\n", iline);
  return 0;
Nice and safe, right?  Both snprintf() and fgets() do a great job of not overflowing their buffers.  Let's run it:

$ ./tst .

Hmm ... didn't get the full input line.  I guess my iline array was too small.  But hey, at least it didn't seg fault, like it might have if I had just used scanf() or something dangerous like that!  No seg faults for me.

$ ./tst ././././././././.
Segmentation fault: 11

Um ... oh, silly me.  My path array was too small.  fopen() failed, and I didn't check its return status.

So I could, and should, check fopen()'s return status.  But that just gives me a more user-friendly error message.  It doesn't tell my *why* the file name is wrong.  Imagine the snprintf() being in a completely different area of the code.  Yes, you discover there's a bug by checking fopen(), but it's nowhere near where the bug actually is.  Same thing, by the way, with the fgets() not reading the entire line.  Who knows how much more code is going to be executed before the program misbehaves because it didn't get the entire line?

And that is my point.  Most of these "safe" functions work the same way: you pass in the size of your buffer, and the functions guarantee that they won't overrun your buffer, but give you *NO* indication that they truncated. I.e. they don't tell you when your buffer is too small.  It's not until later that something visibly misbehaves, and that wastes time and effort working your way back to the root cause.

Now I'm not suggesting that we throw away snprintf() in favor of sprintf().  I'm suggesting that using snprintf() is only half the job.  How about this:

#include <stdio.h>
#include <string.h>
#include <assert.h>
#define BUF2SMALL(b) do {\
  assert(strnlen(b, sizeof(b)) < sizeof(b)-1);\
} while (0)

int main(int argc, char **argv)
  char path[20];
  char iline[5];
  snprintf(path, sizeof(path), "%s/tst.c", argv[1]); BUF2SMALL(path);
  FILE *fp = fopen(path, "r");  assert(fp != NULL);
  fgets(iline, sizeof(iline), fp); BUF2SMALL(iline);
  printf("iline='%s'\n", iline);
  return 0;

Now let's run it:

$ ./tst ./.
Assertion failed: (strnlen(iline, sizeof(iline)) < sizeof(iline)-1), function main, file tst.c, line 15.
Abort trap: 6
$ ./tst ././././././././.
Assertion failed: (strnlen(path, sizeof(path)) < sizeof(path)-1), function main, file tst.c, line 13.
Abort trap: 6

There.  My bugs are reported *much* closer to where they really are.

The essence of the BUF2SMALL() macro is that you should use a buffer which is at least one character larger than the maximum size you think you need.  So if you want an answer string to be able to hold either "yes" or "no", don't make it "char ans[4]", make it at least "char ans[5]".  BUF2SMALL() asserts an error if the string consumes the whole array.

One final warning.  Note that in BUF2SMALL() I use "strnlen()" instead of "strlen()".   I wrote BUF2SMALL() to be a general-purpose error checker after a variety of "safe" functions.  Look at what the man page for "strncpy()" says:
Warning:  If there is no null byte among the first n bytes of src, the string placed in dest will not be null-terminated.
If you use "strncpy()" to copy a string, and use my macro to error check it, the string might not be null-terminated, and  strlen() has a good chance of segfaulting.  These "safe" functions only make the fuse a little longer on the stick of dynamite in your program.

Saturday, June 25, 2016

Of compiler warnings and asserts in a throw-away society

Many people despair at today's "throw away" society.  If you don't want it, just throw it away.

Programmers know this is not a recent phenomenon; they've been throwing stuff away since the dawn of high-level languages.

Actual line from code I'm doing some work on:
    write(fd, str_gpio, len);

The "write" function returns a value, which the programmer threw away.  And I know why without even asking him.  If you were to challenge him, he would probably say, "I don't need the return value, and as for prudent error checking, this program has been running without a glitch for years."

That's fine and good, but I don't like compiler warnings:
warning: ignoring return value of 'write', declared with attribute warn_unused_result [-Wunused-result]
     write(fd, str_gpio, len);

Well, I didn't feel like analyzing the code to see how errors *should* be handled, so I just cast "write" to void to get rid of the compile warning:
    (void)write(fd, str_gpio, len);

Hmm ... still same warning.  Apparently over 10 years ago, glibc decided to make a whole lot of functions have an attribute that makes them throw that warning if the return value is ignored, and GCC decided that functions with that attribute will throw the warning *even if cast to void*.  If you like reading flame wars, the Interwebs are chock full of arguments over this.

And you know what?  Even though I'm not sure I agree with it, it did cause me to re-visit the code and add some actual error checking:
    s = write(fd, str_gpio, len);  assert(s == len);

Hmm ... different warning:
warning: unused variable 's' [-Wunused-variable]
     s = write(fd, str_gpio, len);  assert(s == len);

Huh?  I'm using it right there!  Back to Google.  Apparently, you can define a preprocessor variable to inhibit the assert code.  Some programmers like to have their asserts enabled during testing, but disabled for production for improved efficiency.  The compiler sees that the condition testing code is conditionally compiled, and decides to play it safe and throw the warning that "s" isn't used, even if the condition code is compiled in.  And yes, this also featured in the same flame wars over void casting.  I wasn't the first person to use exactly this technique to try to get rid of warnings.


So I ended up doing what lots of the flame war participants bemoaned having to do: writing my own assert:
#define ASSRT(cond_expr) do {\
  if (!(cond_expr)) {\
    fprintf(stderr, "ASSRT failed at %s:%d (%s)", __FILE__, __LINE__, #cond_expr);\
} } while (0)
    s = write(fd, str_gpio, len);  ASSRT(s == len);

Finally, no warnings!  And better code too.  I just don't like creating my own assert. :-(

Tuesday, May 24, 2016

TCP flow control with non-blocking sends: EAGAIN

So, let's say you're sending data on a TCP socket faster than the receiver can unload it. The socket buffers fill up. Then what happens? The send call returns fewer bytes sent than were requested. Everybody knows that. (Interestingly, does not mention this behavior, but I see certainly it during testing.)

But what if the previous send exactly filled the buffer so that your next send can't put *any* bytes in? Does send return zero? Apparently not. It returns -1 with an errno of EAGAIN or EWOULDBLOCK (also verified by testing).  If I ever knew this, I forgot it till today.

Finally, something I did already know, but rarely include in my code: see this excerpt from
The socket is marked nonblocking and the requested operation would block. POSIX.1-2001 allows either error to be returned for this case, and does not require these constants to have the same value, so a portable application should check for both possibilities.

Sunday, January 10, 2016

Saying goodbye to a bit of personal history

Ever since I was *very* young, I've been interested in science and technology.  At some point in my teens, maybe 40 years ago, I wanted a better VOM (Volt-Ohm-Milliamp meter) than the junky one I had picked up, so I did some research and spent precious funds on a high-impedance FET meter:

It saw pretty heavy use about 5 years, but as I transitioned from electronics to digital logic, and from that to software, my need for it dropped.  I've probably used it twice in the past 15 years, probably for checking if an electrical outlet is live.

As my previous post indicates, I've just gotten a single-board computer, and I was trying to indirectly measure the value of the pull-up resistor on an open-collector output.  I need a reasonably accurate, high impedance meter, so I got out my old FET.

Alas, the two small selector switches were frozen.  Not sure why or how -- it's a *switch* for goodness sake -- but I can't use it if I can't turn it on.  I'll take it apart, but I don't have high hopes.

It's passing is a sad event for me, but why?  Is it just nostalgia?  Longing for a simpler time?  Missing my childhood?  I think it's more than that.  There are certain things that have come to represent turning points in my life.  The meter may not have *caused* a significant shift in my life path, but it had come to represent it.  And maybe its a mortality thing too, like a piece of me died.

Oh well, I'll probably get a cheap DVM.