Thursday, December 31, 2015

C.H.I.P. - a small, cheap computer

UPDATE: C.H.I.P. is dead

Or rather, the company is dead.  I will keep my CHIP content laying around, but I won't be doing much with it.  Eventually, I'll probably move to Raspberry Pi.


I just received the C.H.I.P. single-board computer whose kickstart I supported.

I suspect that anybody reading this blog knows what a Raspberry Pi is.  CHIP is like that, with its own spin which I found appealing.  It's goals:
  • Low cost.  The base price is currently $9 (although see the $5 Pi-zero).
  • Ease and speed of getting started.  It's literally ready to surf the net right out of the box.
  • Small size.  It's smaller around than a business card (although obviously it is thicker).
Accessories most people will use to try it out:
  • Power supply.  It has a micro USB connector, so you might already have a charger that will work.  But be careful: even though the doc says it draws 300 ma peak, get one that can handle over an amp.  My 700 ma supply craps out.  My 1.5 amp does fine.
  • Audio/video cable.  If you have a video camera, you might already have the right one (although I've heard some are wired differently).  If not, you can get one from the CHIP store.
  • TV with 3 "RCA" composite audio/video jacks.  Most TVs have them.
  • USB keyboard.
  • USB mouse.
  • Powered USB hub to power the keyboard and mouse.  Maybe you already have one.  But note that this hub will *not* power the CHIP.  You'll still need the micro USB supply.
  • Case.  Completely optional, but at $2, I suggest getting it.
I did go that route for my first bootup.  Being packrats, my wife and I had all the above items, so we were able to get the GUI operational in about 15 minutes.  Not being very familiar with Linux GUIs, it took another 15 minutes to be surfing the web.  Yep, it really is a general-purpose computer.  But being composite video, it has very low resolution.  And drawing a couple of watts, it isn't very fast (compared to modern laptops).  And being Linux, it doesn't have much PC-like software out of the box.  But really, who would want to use one of these as a replacement for your laptop?  ... Oh.  Um.  Never mind.  :-)


If you aren't interested in the GUI, you don't need any of the above accessories.  All you need is a laptop and a micro USB cable (I had 3 laying around).  Again, be careful: we had a couple cables that were wired for power only and didn't pass signals.

Here's a minimal "getting started" procedure:
    http://www.geeky-boy.com/w/CHIP_startup.html

Thursday, October 15, 2015

Windows corrupting UDP datagrams


We just discovered that under a somewhat unlikely set of circumstances, Microsoft's Windows 7 (SP 1) will corrupt outgoing UDP datagrams.  I have a simple demonstration program ("rsend") which reliably reproduces the bug.  (I'll be pointing my Microsoft contact at this blog post.)

This bug was discovered by a customer, and we were able to reproduce it locally.  I wish I could take the credit, but my friend and colleague, Dave Zabel, did most of the detective work.  And amazing detective work it was!  But I'll leave that description for another day.  Let's concentrate on the bug.


CIRCUMSTANCES FOR THE BUG

1. UDP protocol.  (Duh!)
2. Multicast sends.  (Does not happen with unicast UDP.)
3. A process on the same machine must be joined to the same multicast group as being sent.
4. Window's IP MTU size set to smaller than the default 1500 (I tested 1300).
5. Sending datagrams large enough to require fragmentation by the reduced MTU, but still small enough *not* to require fragmentation with a 1500-byte MTU.

With that mix, you stand a good chance of the outgoing data having two bytes changed.  It seems to be somewhat dependent on the content of the datagram.  For example, a datagram consisting mostly of zeros doesn't seem to get corrupted.  But it's not that hard to find datagram content that *is* consistently corrupted, so my "rsend" demonstration program has one such datagram hard-coded.

Regarding #3, for convenience the rsend program contains code to join the multicast group, but I've also reproduced it without rsend joining, and instead running "mdump" in a different window.

Finally, be aware that I have not done a bunch of sensitivity testing.  I.e. I haven't tried different datagram sizes, different multicast groups, different MTU settings, jumbo frames, etc.  Nor did I try different versions of Windows (only 7), different NICs, etc.  Sorry, I don't have time to experiment.


BUG DEMONSTRATION

This procedure assumes that you have a Windows machine with its MTU at its default of 1500.  (You change it below.)

1. Build "rsend.c" on Windows with VS 2005.  Here's how I build it (from a Visual Studio command prompt):

  cl -D_MT -MD -DWIN32_LEAN_AND_MEAN -I. /Oi -Forsend.obj -c rsend.c

  link /OUT:rsend.exe ws2_32.lib mswsock.lib /MACHINE:I386 /SUBSYSTEM:console /NODEFAULTLIB:LIBCMT rsend.obj

  mt -manifest rsend.exe.manifest -outputresource:rsend.exe;1

2. Run the command, giving it the ip address of the windows machine's interface that you want the multicast to go out of.  For single-homed hosts, just give it the IP address of the machine.  For example:
    rsend 10.1.2.3
To make the tool easy to use, it hard codes the multicast group 239.196.2.128 and destination port 12000.

3. On a separate machine, do a packet capture for that multicast group.  Note that the packet capture utilities I know of (wireshark, tcpdump) do *not* tell the kernel to actually join the multicast group.  I generally deal with this using the "mdump" tool.  Run it in a separate window.  For example:
    mdump 239.196.2.128 12000

In the packet capture, look at the 1278th and 1279th bytes of the UDP datagram data: they should both be zero.  Here they are, with a few bytes preceding them:
    0x75,0x34,0x34,0xa4,0xc5,0xb4,0x00,0x00
NOTE: at this point, the datagram will fit in a single ethernet frame, so no IP fragmentation happens.

4.  While rsend is running, open a command prompt with administrator privilege (right-click on "command prompt" icon and select "run as administrator") and enter:
    netsh interface ipv4 set subinterface "Local Area Connection" mtu=1300 store=persistent

Like magic, bytes 1278 and 1279 of the outgoing UDP datagrams change their values!  Note that with an MTU of 1300, this UDP datagram now needs to be fragmented.  If using wireshark, you'll need to examine the *second* packet to see the entire UDP datagram and get to byte 1278.  I consistently see 0x62,0x27, but that seems to be dependent on datagram content as well.

5. Undo the MTU change:
    netsh interface ipv4 set subinterface "Local Area Connection" mtu=1500 store=persistent

Magically, the bytes go back to their correct values of 0x00,0x00.

Note: if you comment out the setsockopt of IP_ADD_MEMBERSHIP, the corruption will not happen.  The multicast datagrams will still go out, but they will be undamaged when the MTU is reduced.  The obvious suspect is the internal loopback.


SOLUTION

The only solution I know of is to leave the Windows IP MTU at its default of 1500.


WHY SET MTU 1300???

I don't know why our customer set it on one of his systems.  But he said that he would just set it back to 1500, so it must not have been importat.

If you google "windows set mtu size" you'll find people asking about it.  In many cases, the user is trying to reach a web site which is across a VPN or some other private WAN link which does not have an MTU of 1500.  The way TCP works is that it tries to send segments (a TCP segment is basically an IP datagram) as large as possible while avoiding IP fragmentation.  So a TCP instance sending data might start with a 1500-byte segment size.  If a network hop in-transit cannot handle a segment that large, it has a choice: either fragment it or reject it.  TCP explicitly sets an option to say, "do not fragment," so the network hop drops the segment.  It is supposed to return an ICMP error, which the sender's TCP instance will use to reduce its segment size.  This algorithm is known as TCP's "path MTU discovery".

But many network components either do not generate ICMP errors, or do not forward them.  This is supposedly done in the name of "security" (don't get me started).  This breaks path MTU discovery.  But the segments are still being dropped, so eventually the TCP sender times out and the web site doesn't work.  Apparently this is fairly rare, but it does happen.  Hence the  "set MTU" questions.  If the user reduces IP's MTU setting, it artificially reduces the maximum segment size used by TCP.  Do a bit of experimenting to find the right value, et Voila!  (French for "finally, I can download porn!")

So, how could Microsoft possibly not find this during their extensive testing?  Well first of all, UDP use is rare compared to TCP.  Multicast UDP is even more rare.  Sending UDP multicast datagrams larger than MTU is getting close to unicorn rare.  And doing all that with the IP MTU set to a non-standard value?  Heck, I consider myself to be a pretty rigorous tester, and I would never have tried that.


UDPATE:

Thanks to Mr. Anonymous for asking the question about NIC offloading.  We had considered the question previously (see my response in the comments), but in composing my response, I got to thinking about the offset of the corruption.

It's always in the second packet of the fragmented datagram, and always at 1278.  But that offset is with respect to the start of the UDP payload.  What is the offset with respect to the start of the second packet?  I didn't look at this before since Wireshark's ability to reassemble fragmented datagrams is so handy.  But I went ahead and clicked the "Frame" tab and saw that the corruption happens at offset 40 from the start of the packet.

Guess where the UDP checksum belongs in an UNfragmented datagram!  Yep, offset 40.  Something decided to take the second packet of the fragmented datagram and insert a separate UDP checksum where it *should* go if it were not a fragment.

This still seems like a software bug in Windows.  Sure, maybe the NIC is doing the actual calculation.  Maybe it's not.  But it only happens when IP is configured for a non-standard MTU.  If I have MTU=1500 and I send a fragmented datagram, there is no corruption.


UPDATE 2:

I did some experimenting with datagram size and verified something that I suspected.  When the MTU is set to 1300, the corruption only happens when the datagram size is such that a 1500-byte MTU would *not* fragment but a 1300-byte MTU does.  I.e. there is a size range of 200 bytes (the difference between 1300 and 1500).   This is another reason Microsoft's testers apparently didn't discover this.  Even if they tested fragmentation with non-standard MTUs, would they think to test a size in that specific range?  With the benefit of hindsight, sure, it's "obvious".  But if you're just testing combinations of configurations, you would just pick the "send fragments" combination, which is probably chosen to fragment with MTU 1500.  (FYI: I've updated the original post to refine the conditions of the bug.)

I'm normally not a Microsoft cheerleader, so it feels weird to be defending them on this bug.  :-)


UPDATE 3:

Since we noticed that the corruption always happens at offset 40 in the second packet, I decreased the size of the datagram to only include half of the corrupted pair.  Sure enough, the last byte of the datagram got corrupted.  An the second corrupted byte?  Who knows.  I kind of hoped it would corrupt something in Windows and maybe blue-screen it, but no such luck.  I didn't "see" any misbehavior.

Does that mean there *was* no misbehavior?  NO!  The outgoing datagrams suddenly had bad checksums!  Meaning that the mdump tool stopped receiving them since Linux discards datagrams with bad checksums.  But tcpdump captures the packets *before* UDP discards them, so you can see the bad checksums.

I kept decreasing the size of the datagram till it was 1273 bytes.  That still triggers fragmentation when MTU=1300.  The outgoing datagrams had no visible corruption but had bad checksums.  Reduce one more byte, and the datagram fits in one packet.  Suddenly the checksums are OK.

I tried a few things, like sending packets hard, and varying their sizes, but other than the bad checksums I could not see any obvious Windows misbehavior.

I guess my days as a white-hat hacker are over before they started.  (Did I get the tenses right on that sentence?)

Well, I think I'm done experimenting.  If anybody else reproduces it, please let me know your Windows version.


UPDATE 4:

I heard back from my contact at MS.  He said:

We've looked into this, and see what is happening.  If the customer needs to pursue this rather than using a work around (e.g. not setting the MTU size on the loopback path to a different size than the non-loopback interface, etc.) they will need to open a support ticket.  Thank you for letting me know about this."
Which I suspect translates to, "We'll fix it in a future version, but not urgently.  If you need urgency, pay for support."  :-)


REDDIT:

Finally, Hi Reddit users!  Thanks for pushing the hits on this post to many times the total hit count for the whole rest of the blog.  :-)  I read the comments and saw that my first update had already been noticed by somebody else.

Also, something a lot of Reddit comments have fixated on is my claim that UDP multicast is rare.  I meant that the number of programs (and programmers) that use it is very small compared to all software, not that multicast is hardly ever used.  As pointed out, there are several areas of network infrastructure which are multicast-based, so it gets used all the time.  My point is that the number of programmer-hours spent *writing and testing* multicast-based software is very small compared to the overall networking software field.  And as such, it tends not to be as burned-in as, say, TCP.

Also, in most multicast software that I have learned the guts of, the programmer makes sure that datagram sizes are kept small so as to avoid fragmentation.  This seems to be due to the commonly-held idea that you should *never* let IP fragment, which I think comes from the fact that, at least historically, router performance is hurt if it has to perform fragmentation while a datagram is in transit.  I'm not sure if this is still true for modern routers, but historically fragmentation needed to be handled by the supervisory processor.  For the odd packet every now and then, no problem.  For high-rate data flows, it can kill a router.

That seems to be the basis on which a lot of multicast software avoids fragmentation, preferring instead to split large messages into multiple datagrams.  But this reasoning is often not applicable.  Our software intended primarily to be used within a single data center.  When we send a 2K datagram, no router needs to worry about fragmenting it; the sending host's IP stack splits the datagram into packets before they hit the wire.  The intermediate switches and routers all have 1500 MTU, allowing the packets to traverse unmolested.  The final receiving host(s) reassemble and pass the datagram to user space.  This has a noticeable advantage for high-performance applications since the same amount of user data is passed with fewer system calls (the overhead of switching between user and kernel space is significant).

So while I'm sure our software is not alone in sending fragmented multicast datagrams, I stand behind my claim that sending fragmented multicast is relatively rare.

Wednesday, September 30, 2015

Coding Pet Peeves

Ok, these things are not the end of the world.  During a code inspection, I might not even bring them up.  Or maybe I would - it would depend on my mood.


INT OR BOOLEAN

    if (use_tls != 0) {
        /* Code having to do with TLS */
    }

So, what is "use_tls"?  It's obviously somewhat of a flag - if it's non-zero then we need to do TLS stuff - but is it a count?  If it's a count, then it should have been named "use_tls_cnt", or maybe just "tls_cnt".  If it's being used as a boolean, then it should have tested as a boolean:

    if (use_tls) {
        /* Code having to do with TLS */
    }

And yes, this is actual code that I've worked on.  It was being used as a boolean, and should have been tested as one.  Is this a big deal?  No, like I said, I might not have brought it up in a review.  But I am a believer that the code should communicate information to the reader.  The boolean usage tells the reader more than the numeric version (and is easier on the eyes as well).


MALLOC AND FREE

    state_info = OUR_MALLOC_1(sizeof state_info_t);
    /* Code using state_info */
    free(state_info);

Again, actual code I've worked on.  We have a code macro for malloc which does some useful things.  But the author couldn't think of anything useful to do with free so he didn't make a free macro.  He should have.  There have been at least two times (that I know of) when it would have been tremendously useful.  One time we just wanted to log a message for each malloc and free because we suspected we were double-freeing.  Another time we thought that a freed structure was being accessed, and we wanted free to write a pattern on the memory before freeing.

"No problem," you say, "just create that macro and name it free."  Nope.  There are other mallocs: OUR_MALLOC_2, OUR_MALLOC_3, etc (no they aren't actually named that).  And some of the code doesn't use any of the macros, it just calls malloc directly!  For that second feature (writing a pattern before freeing), you need special code in the malloc part as well as the free part.  These all should have been done consistently so that OUR_MALLOC_1 only works with OUR_FREE_1, etc.  That would have allowed us to do powerful things, but as it is, we couldn't.


UNNATURAL CONDITIONAL ORDER

I bet there is an official name for this:

    if ('q' == cmd) break;

I've always felt that code should be written the way you would explain it to somebody.  Would you say, "If q is the command, then exit the loop."  No.  You would say, "If the command is q, then exit the loop."  I've seen this construct a lot where they put the constant in front of the comparison operator, and it is almost always awkward.

Oh, I know there's a good reason for it: it's a form of defensive programming.  Consider:

    if (cmd = 'q') break;

Oops, forgot an equals.  But the compiler will happily assign 'q' to cmd, test it against zero, and always do the break.  Swap the variable and the constant and you get a compile error.

But come on.  Gcc supports the "-parentheses" option which warns the above.  In fact, gcc supports a *lot* of nice optional warnings, which people should be using all the time.  "-Wall" is your friend!


DECLARE CLOSE TO FIRST USE

Ok, I'm more of a recent convert here.  I've written hundreds of functions like this:

    void my_funct(blah blah blah)
    {
        int something = 0;
        int something_else = 1;
        .... and another 10 or 15 variables

In the REALLY old days, you *had* to declare all local variables immediately after the function's open brace.  But modern C compilers let you declare variables anywhere you want.  Variables should be declared and initialized close to their first use.  This has two advantages: as the reader is reading the code and he encounters a variable, he can see right away what its type and initial value is without having to scroll up and back down again.  But more importantly, when the user sees the variable declared, he knows that the variable is not used anywhere above that declaration.  Whereas if it is declared at the top of the function, he has to examine all the function code above the first use to make sure that it isn't secretly being changed from its initial value.


POINTLESS COMMENTS

    i++;  /* increment i */

Oh, thank goodness somebody taught me what the "++" operator does!  Note to programmer: you don't need to teach me the language.

    fclose(file[i]);  /* close the ith file */

Are you sure?  I thought the "fclose" function *opened* a file.  Note to programmer: you don't have to teach me the standard library functions.  They all have man pages.

    session_cleanup(client[i]);  /* clean up the client session */

OK, session_cleanup() probably doesn't have a man page.  But the comment doesn't contain any more information than the function and variable names!

One might add comments to say what the variable "i" is for and what the files in the "filep[]" array are for.  That at least gives information that a reader might not know.  But even those should not be comments, they should be descriptive variable names.

Most comments I've encountered should simply be removed as unnecessary clutter that makes it hard to read the code.  The best comments don't try to explain *what* the code is doing, they explain *why* it is being done.  Usually that is pretty obvious, so only give the "why" explanations when the answer may not be obvious.

    client_num++;

    /* Usually we would close the log file *after* we clean up the
     * session, in case the cleanup code wants to log an error.  But
     * the session_cleanup() function is legacy and is written to
     * open and close the log file itself on error.  */
    fclose(log_file[client_num]);
    session_cleanup(client[client_num]);

That's better.


PETTY PEEVE

"Really Steve?  Brace indention?  What are you, 12?"

(grumble)  I know, I should have grown past this way back when I realized that emacs v.s. vi just doesn't matter.  And really, I don't much care between the various popular styles.  If writing from scratch, I put the open brace at the end of the "if", but I'm also happy to conform to almost any style.

But I just recently had the pleasure of peeking inside the OpenSSL source code and I saw something I hadn't seen for many years:

    if (blah)
        {
        code_to_handle_blah;
        }

It's called "Whitesmith" and supposedly has advantages.  But those advantages seem more philosophical than practical ("the alignment of the braces with the block that emphasizes the fact that the entire block is conceptually (as well as programmatically) a single compound statement").  Um ... sorry, I prefer actual utility over making a subtle point.  Having the close unindented makes it easier to find the end of the block.   The jargon file claims that, "Surveys have shown the Allman and Whitesmiths styles to be the most common, with about equal mind shares."  Really?  I've been a C programmer for over 20 years with 7 different companies, and none of those places used Whitesmith.  This is only the second time I've even seen it.  Equal mind share?   Here's a survey which has Allman 17 times as popular as Whitesmith.  I think jargon file got it wrong.

And yeah, I know that if I were immersed in it for a few months, I wouldn't even notice it any more.  Just like emacs v.s. vi, it doesn't really matter, so long as a project is internally consistent.  But hey, this is *my* rant, and I don't like it!  So there.

Thursday, September 10, 2015

RS-232 serial ports

Remember RS-232?  It is rare now-a-days to find a need for RS-232, but it can come up.  In my case, a couple of machines in our lab have their system consoles tied to a dedicated serial lines which expect a "dumb-ish" terminal (I actually owned a real-life VT-100 for a while).  Again, it is rare that one needs the system console, but especially when you have hardware issues, sometimes it is the only way to get in.

RS-232 is an interface specification which was basically invented to do one thing: connect a MODEM to either a computer or a terminal.  If you look at the RS-232 spec, you will see very MODEM-ish signals, like carrier detect, ring indicator, etc.

The RS-232 interface has two different sides, one named "DTE" and the other named "DCE".  The DTE side is for a terminal or computer.  The DCE side is for a MODEM.  RS-232 is a standard which, among other things, defines what pins on a connector are for which purpose.  If you look at an RS-232 pinout, you will see names on the pins.  For example, on a DB-25 or a DB-9 connector, pin 2 is labeled "Tx".  It is named from the point of view of the DTE, which is to say that the DTE side of the interface is supposed to output to pin 2 serial data it is transmitting to the DCE.  The DCE side is supposed to input from pin 2 serial data it is receiving from the DTE.  So the names make sense for the DTE (it transmits on Tx and I receive on Rx) and the names are backwards for the DCE (it receives on Tx and transmits on Rx).  DTE devices normally have male connectors (with pins) and DCE devices normally have female connectors, although back in the old days, I did find the odd device which violated this standard.

These days, we rarely use MODEMs.  Most of the time we deal with RS-232, we are trying to connect two DTE devices together, like connecting a terminal to a computer.  You can't just use a normal cable for that since both the terminal and the computer are DTEs and will try to transmit on Tx (pin 2).  So to get two DTEs to talk, you need a "null modem".  This is usually a specially-wired cable with a female connector on both ends.  For example, one side's Tx pin is wired to the other side's Rx pin, and vice versa.  See https://en.wikipedia.org/wiki/Null_modem for details.

Modern laptops often don't have serial lines any more, so you'll probably need a USB serial adapter.  I have a Belkin which connects to a Windows PC and presents as COM3 (required driver).  I use putty, which lets you select "serial" and specify which COM port.  I also have a TRENDnet which connects to my Mac and presents as /dev/tty.usbserial (also required driver).  I use the normal Mac "terminal" application and enter:

    screen /dev/tty.usbserial 9600

Both of my USB devices are male DTE.  To use either one as a serial console, I need to use a female-to-female null modem.

Saturday, July 18, 2015

Rsync, Ssh as Root, and Solaris

In my distant past, I dabbled in a bit of Unix system administration.  (How distantly in the past?  SunOS 4.2.  Yeah, I'm old.)

This year, my department lost its system administrator and the decision was made not to replace him. So I and a co-worker (who has also dabbled) are now responsible for administering a lab with 50+ machines, most of them Linux, Windows, and Solaris, but with a scattering of other Unixes like AIX, HP-UX, and FreeBSD.

To make our lives easier, I took the probably inadvisable step of using SSH shared keys to allow our primary file server to log into all other Unix machines as root without password.  (Yes, I know - this allows an intruder who gains root access to our main file server to have root access to all other Unix systems in the lab.  However, understand that the main file server is the only system that particularly *needs* to be secure; the other systems are primarily for testing.  And I don't allow root access in the opposite direction, from test system to file server.)

Anyway, one thing I use this for is to back up selected system areas, like /etc, using rsync.  I was successfully using rsync to back up /etc on most of our test systems, but it wasn't working on Solaris.  A bit of detective work revealed the following (run from our main file server):

    # ssh SolarisBox let | grep PATH
    PATH=/usr/sbin:/usr/bin

This, in spite of the fact that root's .profile set PATH to include /usr/local/bin, which is where rsync lives on our Solaris boxes.  Turns out that on Solaris, .profile is only executed for interactive shells.  I needed to edit /etc/default/login and add:

    SUPATH=desired path

Et voilà!  (Like my Unicode there?)  Backups achieved.

Don't worry, I don't imagine that I'm any kind of expert at system administration.  I won't be posting much on the subject.  :-)

Wednesday, July 8, 2015

UTF-8, Unicode, and International Character Sets (Oh My!)

I'm a little embarrassed to admit that in my entire 35-year career, I've never really learned anything about international character sets (unicode, etc).  And I still know very little.  But recently I had to learn enough to at least be able to talk somewhat intelligently, so I thought I would speed up somebody else's learnings a bit.

A more in-depth page that I like is: http://www.cprogramming.com/tutorial/unicode.html

Unicode - a standard which, among other things, specifies a unique code point (numeric value) to correspond with a printed glyph (character form).  The standard defines a code point space from 1 to 1,114,111 (1 to 0x10FFFF), but as of June 2015 it only assigns 120,737 of those code points to actual glyphs.

UTF-8 - a standard which is basically a means to encode a stream of numeric values with a range from 1 to 2,147,483,647 (1 to 0x7FFFFFFF), using a variable number of bytes (1-6) to represent each value such that numerically smaller values are represented with fewer bytes.  For example, the number 17 requires a single byte to represent, whereas the number 2,000,000,000 requires 6 bytes.  Notice that the largest Unicode code point only requires 4 bytes to represent.


So the Unicode standard specifies that a particular numeric value corresponds to a specific printed character form, and UTF-8 specifies a way to encode those numeric values.  Although the UTF-8 encoding scheme could theoretically be used to encode numbers for any purpose, it was designed to encode Unicode characters reasonably efficiently.  The efficiency derives from the fact that Unicode biases the most-frequently used characters to smaller numeric values.

UTF-8 and Unicode were designed to be backward compatible with 7-bit ASCII, in that the "printable" ASCII characters have numeric values equal to the first 127 Unicode characters (1-127), and UTF-8 can represent those values with a single byte.  Thus the string "ABC" is represented in ASCII as 3 bytes with values 0x41, 0x42, 0x43, and those same three bytes are the valid UTF-8 encoding of the same string in Unicode.  Thus, an application designed to read UTF-8 input is able to read plain ASCII text.  And an application which only understands 7-bit ASCII can read a UTF-8 file which restricts itself to the first 127 Unicode characters.

Another nice thing about UTF-8 is that the bytes of a multi-byte character cannot be confused with normal ASCII characters; every byte of a multi-byte character has the most-significant bit set.  For example, the tilda-n character "ñ" has a Unicode code point 241, and the UTF-8 encoding of 241 is the two-byte sequence  0xC3, 0xB1.  It is also easy to differentiate the first byte of a multi-byte character from subsequent bytes.  Thus, if you pick a random byte in a UTF-8 buffer, it is easy to detect whether the byte is part of a mutli-byte character, and easy to find that character's first byte, or to move past it to the next character.

One thing to notice about UTF-8 is that it is trivially easy to contrive input which is illegal and will not properly parse.  For example, the bytes 0xC3, 0x41 is an illegal sequence in UTF-8 (0xC3 introduces a 2-byte character, and all bytes in a multi-byte character *must* have the most-significant byte set).


Other Encoding Schemes

There are other Unicode encoding schemes, such as UCS-2 and UTF-16, but their usage is declining.  UCS-2 cannot represent all characters of the current Unicode standard, and UTF-16 suffers from problems of ambiguous endianness (byte ordering).  Neither is backward compatible with ASCII text.

Another common non-Unicode-based encoding scheme is ISO-8859-1.  It's advantage is that all characters are represented in a single byte.  It's disadvantage is that it only covers a small fraction of the worlds languages.  It is backward compatible with ASCII text, but it is *not* compatible with UTF-8.  For example, the byte sequence 0xC3, 0x41 is a perfectly valid ISO-8859-1 sequence ("ÃA") but is illegal in UTF-8.  According to Wikipedia, ISO-8859-1 usage has been declining since 2006 while UTF-8 has been increasing.

There are a bunch of other encoding schemes, most of which are variations on the ISO-8859-1 standard, but they represent a small installed base and are not growing nearly as fast as UTF-8.

Unfortunately, there is not a reliable way to detect the encoding scheme being used simply by examining the input.  The input data either needs to be paired with metadata which identifies the encoding (like a mime header), or the user simply has to know what he has and what his software expects.

Most Unixes have available a program named "iconv" which will convert files of pretty much any encoding scheme to any other.  The user is responsible for telling "iconv" what format the input file has.


Programming with Unicode Data

Java and C# have significant features which allow them to process Unicode strings fairly well, but not perfectly.  The Java "char" type is 16 bits, which at the time Java was being defined, was adequate to hold Unicode.  But Unicode evolved to cover more writing systems and 16 bits is no longer adequate, so Java now supports UTF-16 which encodes a Unicode character in either 1 or 2 of those 16-bit chars.  Not being much of a Java or C# programmer, I can't say much more about them.

In C, Unicode is not handled natively at all.

A programmer needs to decide an encoding scheme for in-memory storage of text.  One disadvantage to using something like UTF-8 is that the number of bytes per character varies, making it difficult to randomly access characters by their offset.  If you want the 600th character, you have to start at the beginning and parse your way forward.  Thus, random access is O(n) time instead of O(constant) time that usually accompanies arrays.

One approach that evolved a while ago was the use of a "wide character", with the type "wchar_t".  This would allow you to declare an array of "wchar_t" and be able to randomly access it in O(constant) time.  In earlier days, it was thought that Unicode could live within the range 1-65535, so the original "wchar_t" was 16 bits.  Some compilers still have "wchar_t" as 16 bits (most notably Microsoft Visual Studio).  Other compilers have "wchar_t" as 32 bits (most notably gcc), which makes them a candidate for use with full unicode.

Most recent advice I've seen tells programmers to avoid the use of "wchar_t" due to its portability problems and instead use a fixed 32-bit type, like "uint32_t", which sadly did not exist in Windows until Visual Studio 2010, so you still need annoying conditional compiles to make your code truly portable.  Also, an advantage of wchar_t over uint32_t is the availability of wide flavors of many standard C string handling functions (standardized in C99).

Other opinionated programmers have advised against the use of wide characters altogether, claiming that constant time lookup is vastly overrated since most text processing software spends most of its time stepping through text one character at a time.  The use of UTF-8 allows easy movement across multi-byte characters just by looking at the upper bits of each byte.  Also, the library libiconv provides an API to do the conversions that the "iconv" command does.

And yet, I can understand the attraction of wide (32-bit) characters.  Imagine I have a large code base which does string manipulation.  The hope is that by changing the type from char to wchar_t (or uint32_t), my for loops and comparisons will "just work".  However, I've seen tons of code which assumes that a character is 1 byte (e.g. it will malloc only the number of characters without multiplying by the sizeof the type), so the chances seem small of any significant code "just working" after changing char to wchar_t or uint32_t.

Finally, note that UTF-8 is compatible with many standard C string functions because null can be safely used to indicate the end of string (some other encoding schemes can have null bytes sprinkled throughout the text).  However, note that the function strchr() is *not* generally UTF-8 compatible since it assumes that every character is a char.  But the function strstr() *is* compatible with UTF-8 (so long as *both* string parameters are encoded with UTF-8).


Bottom Line: No Free (or even cheap) Lunch

Unfortunately, there is no easy path.  If I am ever tasked with writing international software, I suspect I will bite the bullet and choose UTF-8 as my internal format.

Thursday, June 4, 2015

C Pointers: Never Too Old to Learn

I've been a C programmer for some 20 years now, and I learned something new about the language just a few days ago.

A less-experienced C coder sent me a question about pointers.  It was in an area which I have seen proficient C coders stumble on.  In my reply, I went on a few digressions to give a better understanding of C pointers.  At one point, I gave the following example:

    char *x1 = "123\n";
    char x2[] = "456\n";
    printf("%p %p", &x1, &x2);

&x1 is obvious - it evaluates to the address of the x1 variable (not the string it points at).  I was all ready to proudly claim that &x2 is illegal and generates a syntax error.  After all, x2 is an array, and referencing an array without an index evaluates to the address of the start of the array.  What does it mean to take the address of an address?    For that to make sense, the address of the array would need to be stored in a pointer variable.  But there *is* no pointer variable holding the address of the x2 array, that address is just an ephemeral value generally held in a register.  So &x2 should be illegal.

But it complied clean, not even a warning!  I got out my trusty K&R 2nd edition (ANSI C) and it says, "The operand [to the & operator] must be an lvalue referring neither to a bit-field nor to an object declared as register, or must be of function type."  Must be an lvalue, and x2 by itself is NOT an lvalue!  (An lvalue can be assigned to: lvalue = rvalue;)  Let's try the C99 spec: "The operand of the unary & operator shall be either a function designator, the result of a [] or unary * operator, or an lvalue that designates an object that is not a bit-field and is not declared with the register storage-class specifier." Different wording, but still specifies an lvalue.

Finally, google led me to a stackoverflow where somebody asked the same question.  The interesting reply:
In C, when you used the name of an array in an expression (including passing it to a function), unless it is the operand of the address-of (&) operator or the sizeof operator, it decays to a pointer to its first element.
That is, in most contexts array is equivalent to &array[0] in both type and value.
In your example, my_array has type char[100] which decays to a char* when you pass it to printf.
&my_array has type char (*)[100] (pointer to array of 100 char). As it is the operand to &, this is one of the cases that my_array doesn't immediately decay to a pointer to its first element.
The pointer to the array has the same address value as a pointer to the first element of the array as an array object is just a contiguous sequence of its elements, but a pointer to an array has a different type to a pointer to an element of that array. This is important when you do pointer arithmetic on the two types of pointer.
(Thanks to Charles Bailey for that answer back in 2010.)

Sure enough, x2 and &x2 both evaluate to the same pointer value, but x2+1 adds 1 to the address, and &x2+1 adds 5 to the address.

I can't find this mentioned in the formal C language specs I checked, but my experiments with the excellent site http://gcc.godbolt.org suggest that it is commonly implemented.  A less-formal C book also mentions it.


C Pointers V.S. Arrays

So much for my newly-learned bit of C.  Since I'm on the subject, here's a rewrite of the digression that triggered my investigation.

What does the following line do?

    char *x1 = "123\n";

The quoted string causes C to allocate an anonymous 5-byte character array in memory and initializes it (at program load time) with 31 32 33 1a 00 (hex).  I call it "anonymous" because there is no identifier (variable name) which directly refers to that array.  All you have is the address of the array, which is what a quoted string evaluates to in an expression.  That address is assigned into a pointer variable, x1.

So far, so good.  Now:

    char x2[] = "456\n";

This does *NOT* allocate an anonymous character array.  It allocates a 5-character array and assigns the variable name"x2" and pre-initializes it with 34 35 36 1a 00 (hex).  So the x2 variable (an array) is completely different from x1 (a pointer).

However, you can use the two variables in the same ways.  For example, this will do what you expect:

    printf("%s%s", x1, x2);  /* prints two lines, "123" and "456" */

It is easy to think of x1 and x2 as more the same than they really are, but I think it helps to analyze it.  Parameters passed to a function are expressions.  The values actually passed are the results of evaluating those expressions.  The above printf function is passed three expressions: "%s%s", x1, and x2.  Here's how they are handled by C:

  • "%s%s" - as before, the quoted string causes C to allocate an anonymous character array and inits it with the supplied characters.  The quoted string evaluates to the address of that array, and that address is passed as the first parameter.
  • x1 - x1 is a simple variable, so the expression evaluates to the content of the variable.  In this case, that content happens to be the address of the string "123\n".
  • x2 - x2 is *not* a simple variable, it is an array.  Normally, an array should be accessed with an index (e.g. x2[1]).  However, when an unindexed array name appears in an expression, C returns the address of the start of the array.  I.e. x2 is the same as &x2[0].  So x2 passes the address of the string "456\n".  But understand the mechanism is completely different; x1 passes the *content* of the variable x1 while x2 passes the *address* of the variable x2.

Similarly, consider this:

    printf("%c%c\n", x1[1], x2[1]);  /* print "25" */

Here again, the results are similar, but the mechanism is different.

  • x1[1] - x1 is *not* an array, so what does x1[1] mean?  It depends on x1's type.  In this case, it is a character pointer, so C takes the address contained in x1 and adds 1*sizeof(*x1) to it, and fetches the character stored at that address.  Since x1 points to type character, sizeof(*x1) evaluates to 1.  So the expression x1[1] evaluates to the character stored at the address in x1 plus 1, which is the character '2'.
  • x2[1] - this is very simple.  Since x2 is an array, x2[1] simply evaluates to the contents of the second element in the array, which is the character '5'.  Note that C does not have to do any pointer arithmetic, it just accessed the array directly.

In the above two examples, the pointer and the array can be thought of similarly.  This true most of the time, and often leads to people forgetting that they are actually different things.  For example:

    printf("%d %d\n", sizeof(x1), sizeof(x2));

This prints "4 5" on a 32-bit machine and "8 5" on a 64-bit machine.  The first number should be clear: x1 is a pointer, so sizeof(x1) should be the number of bytes in a pointer (4 or 8 depending on address length).  But what about the 5?  This is an interesting deviation from the patterns established previously.  If we followed the previous pattern, sizeof(x2) should deconstruct as x2 evaluating to the address of the start of the x2 array.  Then sizeof(x2) should return the size of the address.  However, sizeof is *not* a normal C function, it's a language built-in operator (the parentheses are optional).  It's operand is *not* interpreted as an expression, it's a type.  So sizeof(x1) is not actually the size of the x1 variable, it is the size of the *type* of x1.  Since the type of x2 is an array of 5 characters, sizeof(x2) evaluates to 5.

One more example of x1 and x2 being treated differently, which is the thing I just learned:

    if (x1 == &x1) printf("x1==&x1\n");
    if (x2 == &x2) printf("x2==&x2\n");

The output of the above two lines is the single line "x2==&x2".  Note that this code also generates a warning on the second line since x2 and &x2 are pointers to different types (pointer to a byte v.s. pointer to an array of 5 bytes).

The moral?  Although the mechanisms are different, an array and a pointer to an array can *usually* be treated the same in normal code.  But not always.  It's best to have a deep understanding of what's going on.

Saturday, February 14, 2015

Rapid Cycle Testing with C++

I discovered the convenience of rapid cycle testing in Clojure with "midje" and liked it.  I then re-created it while learning some Ruby.  Now I'm refreshing my C++ skills and wanted the same thing, and decided to combine it with GoogleTest (gtest) and clang on my Mac.

As usual when combining tools, it took a bit of fiddling to get it all playing nicely.  In hopes of saving somebody else some time, I uploaded a sample project to GitHub.

See https://github.com/fordsfords/autotest-cpp

NOTE: has dependencies on other tools; see README.md for details.

I'll probably change my workflow a lot over time, but that minimal example seems to work pretty well.  Run ./autotest.sh in one window and an editor in another.  Every time you write a source file, the automated test runs.  For example:

$ ./autotest.sh 

Now I write "learn.cpp" out, and the autotest window spits out:

autotest: /Users/sford_itunes/drop_sford/Dropbox/sford/src/c++11/autotest-cpp/learn.cpp

[ 33%] Built target gtest
[ 66%] Built target gtest_main
Scanning dependencies of target runUnitTests
[100%] Building CXX object CMakeFiles/runUnitTests.dir/learn.cpp.o
Linking CXX executable runUnitTests
[100%] Built target runUnitTests
        1.77 real         1.40 user         0.19 sys

Running main() from gtest_main.cc
[==========] Running 2 tests from 2 test cases.
[----------] Global test environment set-up.
[----------] 1 test from CoutTests
[ RUN      ] CoutTests.Simple
Hello world
msg2
[       OK ] CoutTests.Simple (0 ms)
[----------] 1 test from CoutTests (0 ms total)

[----------] 1 test from VarInits
[ RUN      ] VarInits.Simple
[       OK ] VarInits.Simple (0 ms)
[----------] 1 test from VarInits (0 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 2 test cases ran. (1 ms total)
[  PASSED  ] 2 tests.
        0.00 real         0.00 user         0.00 sys
tst Sun Feb 15 14:28:25 CET 2015

        1.79 real         1.40 user         0.20 sys

It's a little verbose; I'll probably tighten it up a bit.  Note that it takes almost 2 seconds, most of which is in the build step.  That's on a second generation Macbook Air.  I miss the sub-second cycle time of Clojure Midje and Ruby autotest, but oh well.

By the way, if you download the project from GitHub, you will note that there is no CMakeLists.txt file, and you'll wonder how it can be using cmake.  Just follow the readme instructions and run the ./rebuild.sh script.  That script creates the CMakeLists.txt file.  (It seemed like a good idea at the time...)

Friday, February 13, 2015

Go To Not So Harmful After All

As one who lived through the final throes "Go To Statement Considered Harmful" debate (I still feel like I need a shower on the rare occasions that I use goto), I find "An Empirical Study of goto in C Code" interesting.

(Credit to slashdot for the find.)

 

Edit: just stumbled on https://koblents.com/Ches/Links/Month-Mar-2013/20-Using-Goto-in-Linux-Kernel-Code/

I've been skimming some Linux kernel code, and I'm not sure they got it right, but I tend to agree in principle that gotos can make code more readable and maintainable, especially for languages that don't support exceptions.

Wednesday, February 11, 2015

How to improve your digital audio?

Have an extra $2194 and would prefer to give it to greedy liars instead of worthy charities?  Then buy yourself one of these 10-foot Ethernet cables.  According to the description:
Even finest high-purity metals have imperfections and grain structure, which due to the direction of casting and drawing become non-symmetrical structures without the conductor. In ways which are not fully understood, both analog and digital audio performance is significantly affected by these non-linearities.
Oh, I fully understand it.  Gullible people can be talked into anything.