Showing posts with label C. Show all posts
Showing posts with label C. Show all posts

Thursday, April 2, 2026

lsim and ldraw

Huh. I've never really talked about lsim here. Strange since I'm fairly pleased with it. I made an allusion to it here, but didn't really talk about it. Hmm ... maybe it's because it's a pretty niche project - of no particular use to anybody except me.

HAH! Like that's ever stopped me.

LSIM

Lsim is a hardware logic simulator. You specify a set of devices, like NAND gates, latches, LEDs, and switches, and specify how they connect. Lsim then simulates the circuit. The non-NAND logic devices are simply composites of NAND gates; my goal is to design a simple CPU using only NANDs.

It's a tool that only I could love. There's no GUI. No blinking lights. No wave forms. It's pure text, both input and output. It's a pain to describe the circuit with the little language I devised, and it's a bigger pain to interpret the output to see if it does what it is supposed to. I'm so proud.

Claude has helped me with lsim, mostly by reviewing it for me and finding bugs. I think I had it write one or two little utility functions (who remembers how to write vararg code?), but 99% of it is mine. The code reviews saved me much debugging time. Thanks Claude!

Some day I might make a blog post about its internals - it has a few interesting aspects - but let's skip that for now.

Anyway, my biggest problem has been interpreting the output of LSIM. I find I need to look at a properly-drawn circuit diagram so that I can visually trace signals and verify that the printout is doing what I want. But drawing logic circuits is hard, and it's even harder to ensure that the diagram matches the circuit given to lsim.

LDRAW

So I started the ldraw project. This is a GUI drawing tool that lets me draw a circuit diagram using the devices that lsim supports. It can then export an lsim input file containing the lsim commands to define the devices and connect them. Now I can create a circuit and know that the lsim commands match the drawing. Saves time and is much less error-prone.

It is NOT a general-purpose tool with a large library of standard parts. It is intended to be used with lsim, so it only supports the lsim components.

A few quick notes:

  • It's Javascript and CSS that lives inside a single html file and runs in the chrome browser.
  • It was written by Claude.ai. I used the chat interface (Opus 4.6 in "extended thinking" mode).
  • I "vibe coded" it, a term I don't like, but I don't like "hallucination" either. Coiners of new lingo don't come to me for advice.

Regarding the "vibe coding", I don't know Javascript, and I've never learned the libraries or environment of a browser. Sure, I could have learned it - what, maybe a week or two? - but I also have no interest in GUI work. I.e. it would be a chore. This is my hobby; I avoid chore work whenever possible. So I have not reviewed Claude's code.

Claude has. After every significant phase of development, I ask Claude to: "Perform a deep review for bugs, paying special attention to state management and potential opportunities to make the code more maintainable." Even though it just finished coding, it always finds a few things. One time it decreased code length by about 400 lines by replacing identical repeated code with a few helper functions.

And, of course, I've tested it. Given the nature of the program, most bugs show up pretty quickly. 

I'll post a few interesting details about the methodology we used in a different post.

WHY NOT CLAUDE CODE?

An obvious question: why use the chat interface and not Claude Code?

I tried CC for a different project. It failed. I had asked it to take my lsim language and convert it to a netlist that lcapy could use. Seemed like an easy enough project. CC cranked for an hour or two (with me having to be there the whole time to tell it to keep going). It kept getting errors from lcapy, and having to reverse-engineer the lcapy code to understand why. At the end it declared success. I fired up lcapy with the resulting .sch file, and it was complete garbage.

Now maybe this was just a fundamentally hard problem, and the web interface could not have done any better. Or maybe I didn't use it well (it was my first time trying it). But I can tell you this: using CC wasn't particularly fun. I enjoy the back-and-forth that the chatbot gives me. At the risk of over-anthropomorphizing a chatbot, it feels collaborative instead of directive.

It even laughs at my jokes. (Sort of...)


Thursday, August 14, 2025

Simple C REPL

I'm a fan of language REPLs (Read, Evaluate, Print, Loop). These are interactive programs that let you experiment with the language. For example, running the Python REPL let's you enter Python code interactively and get immediate output. No edit/compile/run cycle. REPLs are useful for experimenting with language features, exploring APIs, reproducing bugs, etc.

But I'm a C programmer, and C doesn't have a REPL. And sometimes I just want to explore details of the language, like sign extension rules and implicit type conversions. C just isn't well-suited to having a REPL. 

But since when has "not well-suited" ever stopped me?

Introducing crepl.sh : https://github.com/fordsfords/crepl

The doc is fairly comprehensive (Thanks Claude!), so I'll just show an annotated sample session:

$ ./crepl.sh
C REPL - Enter C statements or expressions
Type !help for commands
c> int x = 1;
c> x                         -- Omit semicolon to auto-print the expression.
i 1 (0x00000001)             -- The leading "i" indicates type, not the variable name
c> unsigned short j = 2      -- A declaration is not a legal expression; no autoprint!
Compilation error, line rejected. Enter '!errs' for details.
c> unsigned short j = 2;     -- Semicolon suppresses autoprint
c> j
us 2 (0x0002)                -- Autoprint knows its an unsigned short.
c> x+j
i 3 (0x00000003)
c> j=j+1                     -- An assignment statement is an expression.
us 3 (0x0003)
c> char c = -1;
c> c
c -1 (0xff)
c> x = c
i -1 (0xffffffff)                     -- Nice sign extension!
c> int inc(int x) { x++; return x; }; -- Define a function all on a single line.
c> inc(88)
i 89 (0x00000059)
c> x
i -1 (0xffffffff)                     -- Naturally "inc()" has its own local x.
c> int inc_x() { ++x; };              -- New funct. Oops, I forgot to return something.
c> int inc_x() { ++x; return x; };    -- Fix the funct? No, you can't re-define it.
Compilation error, line rejected. Enter '!errs' for details.
c> !vi                                -- This edits the code so far. I deleted inc_x().
c> int inc_x() { ++x; return x; };    -- Now I can define it properly.
c> inc_x()
i 0 (0x00000000)                      -- It treated x as a global. But is it?
c> inc_x()
i 1 (0x00000001)
c> x
i 1 (0x00000001)
c> !help
Commands:
  !help  - Show this help.
  !errs  - Show compilation/runtime errors from last attempt. Note that line numbers refer to the 'crepl_temp.c' file.
  !new   - Clear all accumulated code.
  !list  - Show current accumulated code.
  !vi    - Edit accumulated code in vi.
  !source filename - read input from filename.
  !sh    - start an interactive subshell. Exit shell to return to crepl.
  !quit  - Exit the REPL
Autoprint types handled:
  char, unsigned char, short, unsigned short,
  int, unsigned int, long, unsigned long,
  long long, unsigned long long, float, double
c> !list
Current code:

int x = 1;
x;
unsigned short j = 2;
j;
x+j;
j=j+1;
char c = -1;
c;
x = c;
int inc(int x) { x++; return x; };
inc(88);
x;
;
int inc_x() { ++x; return x; };
inc_x();
inc_x();
x;
c> !quit
Goodbye!

So, is the variable 'x' a global? The inc_x() function incremented it, so it must be, right? (Spoiler, it's not. See this doc for explanation.)

Friday, December 13, 2024

Some Useful C Modules

 I'm working on a non-trivial bit of C programming, and I decided to externally modularize three parts of it as potentially reusable components:

  1. err - error-handling module.
  2. hmap - hash map module.
  3. cfg - configuration file loader module.
All three are intend to be simple and small. Note that "err" has no external dependencies, "hmap" leverages "err" but includes a copy of it in its repo, and "cfg" leverages both "err" and "hmap", and includes copies of them too. I know having all those copies seems wasteful, but C doesn't have the same kind of dependency and versioning infrastructure that Java has, and including the files makes each repo stand-alone.

Of the three, "err" will probably be the least likely to be reused by anybody but me, but in some ways is the most helpful, IMO. To quote from its doc:

The C language does not have a well-established methodology for APIs to report errors. Java has exceptions, but C does not. The closest thing that C has to a common methodology is the Unix common practice in which a function returns certain valid values that represent success (the values can vary according to the API function), and a certain invalid value for failure. Callers are expected to check the return value for validity, and refer to "errno" (when available) for information about the error.

In my experience, that kind of error reporting methodology is a recipe for unreliable programs that are hard to debug and fix. Thousands of lines of code written that call APIs without checking the return status, or does check but only prints something unhelpful even to the code maintainers.
See also my earlier post, "Error handling: the enemy of readability?". Note that the "err" system described here is NOT the same as described in that earlier post, but you'll see similarities. This current "err" system evolved the right way, by putting it to use in non-trivial coding efforts. It is battle-tested and has proven its worth ... at least to me.

Finally, full disclosure, Claude.ai helped me in many ways. A little bit with the coding itself, but much more so in many other ways. See "Claude as Coder's Assistant" for a longer description of how I use Claude.

Finally, as I alluded, the above three modules are just supporting cast members of a larger effort that I've been working on: lsim - a digital logic simulator. It's a hobby project - no self-respecting hardware engineer would actually use it for real work - but it's been a fun couple of months getting it working! And, as an actual user of the above three modules, it has evolved those modules into something more useful than they were when they started. It's hard to make a good API if you don't eat your own dogfood.

You might notice that lsim is not well-documented (ok, the hardware definition language IS well documented, thanks to Claude.ai). This is because lsim, while the most ambitious of these code bases, is also the least likely to have anything usable or re-usable by anybody but me.

If I'm wrong and you would like to take it for a test drive, let me know and I'll give you a hand.

Tuesday, November 26, 2024

Strdup Considered Harmful?

This should be short. I've been writing some code and decided to see if it was C99 compliant. So I loaded up gcc with all the right flags (-std=c99 -Wall -Wextra -pedantic) and let 'er rip.

Huh? What do you mean strdup() is implicitly defined? I'm including string.h!

Well, fancy that. Learn something new every day. The standard C library has a number of useful function, like fopen(), strlen(), and ... not strdup(). Note I said "standard" there. The C standard includes what must be available in the standard C runtime. And the strdup() function is not one of them.

Sure, lots of runtimes have it - glibc has had it for I-don't-know how long. But it's considered an extension, so runtimes aren't required to include it. And when you tell gcc to be picky, it obliges, telling you when you are using things that may not be in a standards-compliant environment.

Now that is not to say that strdup() isn't in *any* standard. It is in POSIX. So a POSIX-compliant runtime will have it. But you can be C99 compliant but not POSIX compliant.

The latest C standard, C23, does include it. And it hasn't changed, so you don't have to re-write all your code. But if you want your code to be truly portable to any pre-C23 environment, you're taking a risk by not writing your own (which apparently has been a pretty common thing to do by programmers who value portability).

(Thanks to chux for some of this info.)

Friday, August 25, 2023

Visual Studio Code

 I've been doing more coding than usual lately. As a vi user, I've been missing higher-level IDE-like functionality, like:

  • Shows input parameters to functions (without having to open the .h file and search).
  • Finds definitions of functions, variables, and macros.
  • Finds references to same.
  • Quickly jumping to locations of compile errors. (Most IDEs do syntax checking as you type.)
  • Source-level debugging.
There are other functions as well, like code-refactoring, static analysis, and "lint" capabilities, but the above are the biggies in my book.

Anyway, I've used Visual Studio, Eclipse, and JetBrains, and found those higher-level functions helpful. But I hate GUI-style text editors.

I've gotten good at using emacs and vi during my many years of editing source files. It takes time to get good at a text editor - training your fingers to perform common functions lightning fast, remembering common command sequences, etc. I finally settled on vi because it is already installed and ready to use on every Unix system on the planet. And my brain is not good at using vi one day and emacs the next. So I picked vi and got good at it. (I also mostly avoid advanced features that aren't available in plain vanilla vi, although I do like a few of the advanced regular expressions that VIM offers.)

So how do I get IDE-like functionality in a vi-like editor?

I looked Vim and NeoVIM, both of which claim to have high-quality IDE plugins. And there are lots of dedicated users out there who sing their praises. But I've got a problem with that. I'm looking for a tool, not an ecosystem. If I were a young and hungry pup, I might dive into an ecosystem eagerly and spend months customizing it exactly to my liking. Now I'm a tired old coder who just wants an IDE. I don't want to spend a month just getting the right collection of plugins that work well together.

(BTW, the same thing is true for Emacs. A few years ago, I got into Clojure and temporarily switched back to Emacs. But again, getting the right collection of plugins that work well together was frustratingly elusive. I eventually gave up and switched back to vi.)

Anyway, as a tired old coder, I was about to give up on getting IDE functionality into a vi-like editor, but decided to flip the question around. What about getting vi-like editing into an IDE?

Turns out I'm not the first one to have that idea. Apparently most of the IDEs have vi editing plugins now-a-days. This was NOT the case several years ago when I last used an IDE. I used a vi plugin for Eclipse which ... kind of worked, but had enough problems that it wasn't worth using.

That still leaves the question: which IDE to use. Each one has their fan base and I'm sure each one has some feature that it does MUCH better than the others. Since programming is not my primary job, I certainly won't become a power user. Basically, I suspect it hardly matters which one I pick.

I decided to start with Visual Studio Code for a completely silly reason: it has an easy integration with GitHub Copilot. I say it's silly because I don't plan to use Copilot any time soon! For one thing, I don't code enough to justify the $10/month. And for another, coding is my hobby. The small research I've done into Copilot suggests that to get the most out of it, you shift your activities towards less coding and more editing and reviewing. While that might be a good thing for a software company, it's not what I'm looking for in a hobby. But that's a different topic for a different post.

Anyway, I've only been using Visual Studio Code for about 30 minutes, and I'm already reasonably pleased with the vi plugin (but time will tell). And I was especially pleased that it has a special integration with Windows WSL (I'm not sure other IDEs have that). I was able to get one of my C programs compiled and tested. I even inserted a bug and tried debugging, which was mildly successful.


Monday, May 8, 2023

More C learning: Variadic Functions

This happens to me more often than I like to admit: there's a bit of programming magic that I don't understand, and almost never need to use, so I refuse to learn the method behind the magic. And on the rare occasions that I do need to use it, I copy-and-tweak some existing code. I know I'm not alone in this tendency.

The advantage is that I save a little time by not learning the method behind the magic.

The disadvantages are legion. Copy-and-tweak without understanding leads to bugs, some obvious, others not so much. Even the obvious bugs can take more time to track down and fix than it would have taken to just learn the magic in the first place.

Such was the case over the weekend when I wanted to write a printf-like function with added value (prepend a timestamp to the output). I knew that variadic functions existed, complete with the "..." in the formal parameter list and the "va_list", "va_start", etc. But I never learned it well enough to understand what is going on with them. So when I wanted variadic function A to call variadic function B which then calls vprintf, I could not get it working right.

Ugh. Guess I have to learn something.

And guess what. It took almost no time to understand, especially with the help of the comp.lang.c FAQ site. Specifically, Question 15.12: "How can I write a function which takes a variable number of arguments and passes them to some other function (which takes a variable number of arguments)?" Spoiler: you can't. Which makes sense when you think about how parameters are passed to a function. The longer answer: there's a reason for the "leading-v" versions of the printf family of functions. And the magic is not as magical as I imagined. All I needed to do is create my own non-variadic "leading-v" version of my function B which my variadic function A could call, passing in a va_list. See cprt_ts_printf().

This post is only partly about variadic functions; it's also about the reluctance to learn something new. Why would an engineer do that? I could explain it in terms of schedule pressure and the urge to make visible progress ("stop thinking and start typing!"), but I think there's something deeper going on. Laziness? Fear of the unknown? I don't know, but I wish I didn't suffer from it.

By the way, that comp.lang.c FAQ has a ton of good content. Good thing to browse if you're still writing in C.

Tuesday, December 27, 2022

Tgen: Traffic Generator Scripting Language

Oh no, not another little language! (Perhaps better known as a "domain-specific language".)

Yep. I wrote a little scripting language module intended to assist in the design and implementation of a useful network traffic generator. It's called "tgen" and can be found at https://github.com/fordsfords/tgen. I won't write much about it here other than to mention that it implements a simple interpreter. In fact, the simplicity of the parser might be the thing I'm most pleased about it, in terms of bang for buck. See the repo for details.

Little Languages Considered Harmful?

So yeah, I'm reasonably pleased with it. But I am also torn. Because "little languages" have both a good rap (Jon Bentley, 1986) and a bad rap (Olin Shivers, 1996).

In that second paper, Shivers complains that little languages:

  • Are usually ugly, idiosyncratic, and limited in expressiveness.
  • Basic linguistic elements such as loops, conditionals, variables, and subroutines must be reinvented and re-implemented. It is not an approach that is likely to produce a high-quality language design.
  • The designer is more interested in the task-specific aspects of his design, to the detriment of the language itself. For example, the little language often has a half-baked variable scoping discipline, weak procedural facilities, and a limited set of data types.
  • In practice, it often leads to fragile programs that rely on heuristic, error-prone parsers.
Of course, Shivers doesn't *really* think that little languages are a bad idea. He just thinks that they are usually implemented poorly, and his paper shows the right way to do it (in Scheme, a Lisp variant).

But there are some good arguments against developing little languages at all, even if implemented well. At my first job out of college, I wrote a little language to help in the implementation of a menu system. The menus were tedious and error-prone to write, and the little language improved my productivity. I was proud of it. An older and wiser colleague gently told me that there are some fundamental problems with the idea. His reasoning was as follows:

  • We already have a programming language that everybody knows and is rich and well-tested.
  • You've just invented a new programming language. It probably has bugs in the parser and interpreter that you'll have to find and fix. Maybe the time you spend doing that is paid for by the increased productivity in adding menus. Maybe not.
  • The new language is known by exactly one person on the planet. Someday you'll be on a different project or a different company, and we can't hire somebody who already knows it. There's an automatic learning curve.
  • Instead of writing an interpreter for a little language, you could have simply designed a good API for the functional elements of your language, and then used the base language to call those functions. Then you have all the base language features at your disposal, while still having a high level of abstraction to deal with the menus.

He was a nice guy, so while his criticism stung, he didn't make it personal, and he was tactful. I was able to see it as a learning experience. And ever since then, I've been skeptical of most little languages.

Then Why Did I Make a Little Language?

Well, first off, I *DID* create an API. So instead of writing scripts in the scripting language, you *can* write them in C/C++. I expect this to be interesting to QA engineers wanting to create automated tests that might need sophisticated usage patterns (like waiting to receive a message before sending a burst of outgoing traffic). I would not want to expand my scripting language enough to support that kind of script. So being able to write those tests in C gives me all the power of C while still giving me the high level of abstraction for sending traffic.

But also, a network traffic generator is a useful thing to be able to run interactively for ad-hoc testing or exploration. It would be annoying to have to recompile the whole tool from source each time you want to change the number or sizes of messages.

Of course, most traffic generation tools take care of that by letting you specify the number and sizes of the messages via command-line options or GUI dialogs. But most of them don't let you have a message rate that changes over time. My colleagues and I deal with bursty data. To properly test the networking software, you should be able to create bursty traffic. Send at this rate for X milliseconds, then at a much higher rate for Y milliseconds, etc. The "tgen" module lets you "shape" your data rate in non-trivial ways.

Did I get it right? My looping construct is ugly and could only be loved by an assembly language programmer. Maybe I should have made it better? Or just omitted it? Dunno. I'm open to discussion.

Anyway, I'm hoping that others will be able to take the tgen module and do something useful with it.

Thursday, August 5, 2021

Timing Short Durations

 I don't have time for a long post (HA!), but I wanted to add a pointer to https://github.com/fordsfords/nstm ("nstm" = "Nano Second Timer"). It's a small repo that provides a nanosecond-precision time stamp portably between MacOS, Linux, and Windows.

Note that I said precision, not resolution. I don't know of an API on Windows that gives nanosecond resolution. The one Microsoft says you should use (QueryPerformanceCounter()) always returns "00" as the last two decimal digits. I.e. it is 100 nanosecond resolution. They warn against using "rdtsc" directly, although I wonder if most of their arguments are mostly no longer applicable. I would love to hear if anybody knows of a Windows method of getting nanosecond resolution timestamps that is reliable and efficient.

One way to measure a short duration "thing" is to time doing the "thing" a million times (or whatever) and take an average. One advantage of this approach is that taking a timestamp itself takes time; i.e. making the measurement changes the thing you are measuring. So amortizing that cost over many iterations minimizes its influence.

But sometimes, you just need to directly measure short things. Like if you are histogramming them to get the distribution of variations (jitter).

I put some results here: https://github.com/fordsfords/fordsfords.github.io/wiki/Timing-software


Wednesday, October 14, 2020

Strace Buffer Display

 The "strace" tool is powerful and very useful. Recently a user of our software sent us an strace output that included a packet send. Here's an excerpt:

sendmsg(88, {msg_name(16)={sa_family=AF_INET, sin_port=htons(14400), sin_addr=inet_addr("239.84.0.100")}, msg_iov(1)=[{"\2\0a\251C\27c;\0\0\2\322\0\0/\263\0\0\0\0\200\3\0\0", 24}], msg_controllen=0, msg_flags=0}, 0) = 24 <0.000076>

Obviously there's some binary bytes being displayed. I see a "\0a", so it's probably hex. But wait, there's also a \251. Does that mean 0x25 followed by ascii '1'? I decoded it assuming hex, and the packet wasn't valid.

So I did a bit of Googling. Unfortunately, I didn't note where I saw it, but somebody somewhere said that it follows the C string conventions. And C strings come from long ago, when phones had wires connecting them to wall jacks, stack overflow was a bug in a recursive program, and octal ruled the waves when it came to specifying binary data.

So \0a is 0x00 followed by ascii 'a' and \251 is 0xa9. Now the packet parses out correctly. (It's a "Session Message" if you're curious.)

So, I guess I'm a little annoyed by that choice as I would prefer hex, but I guess there's nothing all that wrong with it. Hex or octal: either way I need a calculator to convert to decimal. (And yes, I'm sure some Real Programmers can convert in their heads.)

Thursday, January 24, 2019

Volatile considered harmful

I happened on this today.  The article is narrowly-focused on Linux kernel work, but in my mind it helps to clarify a lot of "volatile" debate I've seen over the years.

https://www.kernel.org/doc/html/latest/process/volatile-considered-harmful.html


I will note that when Corbet (the author) says, "the 'volatile' type class should not be used", what he really means is that you should not declare variables with volatile (or rather, almost never).  Corbet says, "the kernel primitives which make concurrent access to data safe ... If they are used properly, there will be no need to use volatile as well."  Some of those kernel primitives use volatile, but not in variable declarations.  Instead they use volatile in carefully-selected casts.

For example, as described in another Corbet article, he talks about another kernel primitive, "ACCESS_ONCE()".  It is defined as:

    #define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))

The variable being accessed is temporarily cast to volatile to allow code to be written that violates threading assumptions made by the compiler's optimizer.  Like here, for example.  :-)

I only bring this up to point out that Corbet is not arguing that volatile should never be used at all.  Rather he is arguing that programmers (specifically kernel programmers) should not declare variables to be volatile.  Generally, programmers should use threading primitives to ensure correct code, and if the code's requirements prevent the use of the usual threading primitives, then lower-level primitives (like "ACCESS_ONCE()") should be used to precisely target volatile's use.

Tuesday, September 11, 2018

Safe sscanf() Usage

The scanf() family of functions (man page) are used to parse ascii input into fields and convert those fields into desired data types.  I have very rarely used them, due to the nature of the kind of software I write (system-level, not application), and I have always viewed the function with suspicion.

Actually, scanf() is not safe.  Neither is fscanf().  But this is only because when an input conversion fails, it is not well-defined where the file pointer is left.  Fortunately, this is pretty easy to deal with - just use fgets() to read a line and then use sscanf().

You can use sscanf() safely, but you need to follow some rules.  For one thing, you should include field widths for pretty much all of the fields, not just strings.  More on the rules later.


Input Validity Checking with sscanf()

I am a great believer in doing a good job of checking validity of inputs (see strtol preferred over atoi).  It's not enough to process your input "safely" (i.e. preventing buffer overflows, etc.), you should also detect data errors and report them.  For example, if I want to input the value "21" and I accidentally type "2`", I want the program to complain and re-prompt, not simply take the value "2" and ignore the "`".

One problem with doing a good job of input validation is that it often leads to lots of checking code.  In strtol preferred over atoi, my preferred code consumes 4 lines, and even that is cheating (combining two statements on one line, putting close brace at end of line).  And that's for a single integer!  What if you have a line with multiple pieces of information on it?

Let's say I have input lines with 3 pipe-separated fields:

  id_num|name|age

where "id_num" and "age" are 32-bit unsigned integers and "name" is alphabetic with a few non-alphas.  We'll further assume that a name must not contain a pipe character.

Bottom line: here is my preferred code:

#define NAME_MAX_LEN 60
. . .
unsigned int id;
char name[NAME_MAX_LEN+1];  /* allow for null */
unsigned int age;
int null_ofs = 0;

(void)sscanf(iline,
    "%9u|"  /* id */
    "%" STR(NAME_MAX_LEN) "[-A-Za-z,. ]|"  /* name */
    "%3u\n"  /* age */
    "%n",  /* null_ofs */
    &id, name, &age, &null_ofs);

if (null_ofs == 0 || iline[null_ofs] != '\0') {
    printf("PARSE ERROR!\n");
}


Error Checking

My long-time readers (all one of you!) will take issue with me throwing away the sscanf() return value.  However, it turns out that sscanf()'s return value really isn't as useful as it should be.  It tells you how many successful conversions were done, but doesn't tell you if there was extra garbage at the end of the line.  A better way to detect success is to add the "%n" at the end to get the string offset where conversion stopped, and make sure it is at the end of the string.

If any of the conversions fail, the "%n" will not be assigned, leaving it at 0, which is interpreted as an error.  Or, if all conversions succeed, but there is garbage at the end, "iline[null_ofs]" will not be pointing at the string's null, which is also interpreted as an error.


STR Macro

More interesting is the "STR(NAME_MAX_LEN)" construct.  This is a specialized macro I learned about on StackOverflow:

#define STR2(_s) #_s
#define STR(_s) STR2(_s)

Yes, both macros are needed.  So the construct:
  "%" STR(NAME_MAX_LEN) "[-A-Za-z,. ]"
gets compiled to the string:
  "%60[-A-Za-z,. ]"

So why use that macro?  If I had hard-coded the format string as "%60[-A-Za-z,. ]", it would risk getting out of sync with the actual size of "name" if we find somebody with a 65 character name.  (I would like it even more if I could use "sizeof(name)", but macro expansion and "sizeof()" are handled by different compiler phases.)


Newline Specifier

This one a little subtle.  The third conversion specifier is "%3u\n".  This tells sscanf() to match newline at the end of the line.  But guess what!  If I don't include a newline on the string, the sscanf() call still succeeds, including setting null_ofs to the input string's null.  In my opinion, this is a slight imperfection in sscanf(): I said I needed a newline, but sscanf() accepts my input without the newline.  I suspect sscanf() is counting newline as whitespace, which it knows to skip over.

If I *really* want to require the newline, I can do:

if (iline[null_ofs - 1] != '\n') then ERROR...


Integer Field Widths

Finally, I add field widths to the unsigned integer conversions.  Why?  If you leave it out, then it will do a fine job of converting numbers from 0..4294967295.  But try the next number, 4294967296, and what will you get?  Arithmetic overflow, with a result of 0 (although the C standards do not guarantee that value).  And sscanf() will not return any error.  So I restrict the number of digits to prevent overflow.  The disadvantage is that with "%9u" you cannot input the full range of an unsigned integer.  An alternative is to use "%10Lu" and supply a long long variable.  Then range check it in your code.  Or you could just use "%8x" and have the user input hexidecimal.  Heck, you could even input it as a 10-character string and use strtol()!

I guess "%9u" was good enough for my needs.  I just wish you could just tell sscanf() to stop converting on arithmetic overflow!


Not Perfect

I already mentioned arithmetic overflow. Any other problems with the above code?  Well, yeah, I guess.  Even though I specify "id" and "age" to be unsigned integers, I notice that sscanf() will allow negative numbers to be inputted.  Sometimes we engineers consider this a feature, not a bug; we like to use 0xFFFFFFFF as a flag value, and it's easier to type "-1".  But I admit it still bothers me.  If I input an age of -1, we end up with the 4-billiion-year-old man (apologies to Mel Brooks).


Tuesday, September 4, 2018

Safe C?

A Slashdot post led me to some good pages related to C safety.
(Update: actually I did go ahead and order the tee shirt.)


Monday, August 27, 2018

Safer Malloc for C

Here's a fragment of code that I recently wrote.  See anything wrong with it?

#define VOL(_var, _type) (*(volatile _type *)&(_var))
. . .
lbmbt_rcv_binding_t *binding = NULL;
. . .
binding = (lbmbt_rcv_binding_t *)malloc(sizeof(lbmbt_rcv_t));
memset(binding, 0xa5, sizeof(lbmbt_rcv_t));
binding->signature = SIGNATURE_OK;
. . .
VOL(binding->signature), int) = SIGNATURE_FREE;
free(binding);

No, the problem isn't in that strange VOL() macro; that is explained here.

The problem is that I am trying to allocate an "lbmbt_rcv_binding_t" structure, but because of cut-and-past programming and being in a hurry, I used the size of a different (and smaller) structure: lbmbt_rcv_t.  Since the "signature" field is the last field in the structure, the assignments to it write past the end of the allocated memory block.  GAH!

But we all knew that C is a dangerous language, with its casting and sizeof just begging to be used wrong.  But could it be at least a *little* safer?


A Safer Malloc

#define VOL(_var, _type) (*(volatile _type *)&(_var))

/* Malloc "n" copies of a type. */
#define SAFE_MALLOC(_var, _type, _n) do {\
  _var = (_type *)malloc((_n) * sizeof(_type));\
  if (_var == NULL) abort();\
} while (0)
. . .
lbmbt_rcv_binding_t *binding = NULL;
. . .
SAFE_MALLOC(binding, lbmbt_rcv_t, 1);
memset(binding, 0xa5, sizeof(lbmbt_rcv_t));
binding->signature = SIGNATURE_OK;
. . .
VOL(binding->signature), int) = SIGNATURE_FREE;
free(binding);

The above code still has the bug -- it passed in the wrong type -- but at least it generates a compile warning.  The cast of malloc doesn't match the type of the variable.  Since I insist on getting rid of compiler warnings, that would have flagged the bug to me.


If you don't need portability, you can even use the gcc-specific extension "typeof()" in BOTH macros:

#define VOL(_var) (*(volatile typeof(_var) *)&(_var))

/* Malloc "n" copies of a type. */
#define SAFE_MALLOC(_var, _n) do {\
  _var = (typeof(*_var) *)malloc((_n) * sizeof(typeof(*_var)));\
  if (_var == NULL) abort();\
} while (0)
. . .
lbmbt_rcv_binding_t *binding = NULL;
. . .
SAFE_MALLOC(binding, 1);
memset(binding, 0xa5, sizeof(lbmbt_rcv_t));
binding->signature = SIGNATURE_OK;
. . .
VOL(binding->signature) = SIGNATURE_FREE;
free(binding);

Now the malloc bug is gone ... by design!


Safer Malloc and Memset

Finally, notice that the memset is also wrong.  Since I frequently like to init my malloced segments to a known pattern (0xa5 is my habit), I can define a second macro:

#define VOL(_var) (*(volatile typeof(_var) *)&(_var))

/* Malloc "n" copies of a type and set its contents to a pattern. */
#define SAFE_MALLOC_SET(_var, _pat, _n) do {\
  _var = (typeof(*_var) *)malloc((_n) * sizeof(typeof(*_var)));\
  if (_var == NULL) abort();\
  memset(_var, _pat, (_n) * sizeof(typeof(*_var)));\
} while (0)
. . .
lbmbt_rcv_binding_t *binding = NULL;
. . .
SAFE_MALLOC_SET(binding, 0xa5, 1);
binding->signature = SIGNATURE_OK;
. . .
VOL(binding->signature) = SIGNATURE_FREE;
free(binding);

No more bugs.


P.S.

It took me WAY longer than it should have to track this down (an entire weekend).  Why?  The bug manifested as a segmentation fault.  So I figured I could just valgrind it.  But lo and behold, I didn't have Valgrind on my Mac.  "No Problem," sez I.  I sez, "I'll just install it."

$ brew install valgrind
valgrind: This formula either does not compile or function as expected on macOS
versions newer than Sierra due to an upstream incompatibility.

GAH!  "Still no problem," sez I.  I sez, "I'll just debug it on Linux."

No seg fault.  Program worked fine.  Valgrind makes no complaint.  So I spent a the weekend doing a divide-and-conquer trackdown of the bug in a large codebase to find the "Mac-specific" bug.  And finally found the bad malloc.

But wait!  That's not Mac-specific!  What's going on here?  After wasting some more time, I finally printed the sizeof() for each structure type.  On Mac, the sizes are different.  On Linux, the structures are the same size.  Of course valgrind on Linux said it was OK -- the malloc size was correct on Linux!

And now I have Valgrind on my mac  which pinpointed the bug immediately.  (Thanks Güngör Budak!).

Monday, May 22, 2017

Some multicast programming tips

Never too old to learn.  :-)

There are lots of multicast example programs out there, so I won't try to compete with them.  But I did run across several things that weren't explained very well.


Single Socket, Multiple Groups

Yes, you can create a single socket and have it receive datagrams from multiple multicast groups.  Just include multiple calls to:
  setsockopt(recv_sock, IPPROTO_IP, IP_ADD_MEMBERSHIP, ...


Multiple Sockets, One Group per Socket

This is another common use case, where you create multiple sockets for receiving, with each socket joined to a different multicast group.


Binding the Receive Socket

Since a socket needs to be bound to a port to receive any kind of UDP datagram, multicast or unicast, you need to include a call to bind().  You pass in a sockaddr_in with the sin_port set as desired (remember to pass it in network order).  But what about the sin_addr?  What do you set that to?

Many people set it to INADDR_ANY, which is what I did in a recent program.  But in the multiple sockets, different group per socket case, it had an unexpected side effect.  All of my sockets were bound to the same destination port, but joined to different multicast groups.  With sin_addr set to INADDR_ANY, the kernel took each received datagram, replicated it, and delivered a copy to *every* socket, even if the datagram's destination group is different from the one joined to the stocket! I.e. simply doing the IP_ADD_MEMBERSHIP on a socket didn't filter datagrams based on the desired group.  When a multicast datagram was received, the kernel just used the destination port and delivered a copy to every UDP socket bound to that port and INADDR_ANY.

I had to do some extra searching to find out that you can set the bind's sin_addr to the multicast group.  I have some reason to suspect that this is not portable across all operating systems, but at least it works on Linux.  Now I can have 10 sockets, each bound to the same port (don't forget SO_REUSEADDR) but different multicast groups.  When a multicast datagram is received, it is delivered *only* to the socket which is bound to the right port/multicast group pair.


Single Socket, Multiple Groups, reprise

So, what about the case where you have a single socket joined to multiple groups?  In that case, you *do* want to use INADDR_ANY in the bind.


Mix and Match?

I guess this poses a restriction.  You can't have, say, 2 sockets that you distribute 4 multicast groups across, with two groups each.  Why would you want to do that?  Maybe to load-balance across threads.  But assuming they all want to bind to the same port, you can't do it.  Setting the sin_addr to INADDR_ANY prevents filterig, and will mean that both sockets will receive a copy of every datagram sent. But you can't set sin_addr to multiple multicast groups.

So if you want to have multiple sockets, multiple groups, and the same destination port, you need to have one group per socket, and bind that socket to the group.

Sunday, June 26, 2016

snprintf: bug detector or bug preventer?

Pop quiz time!

When you use snprintf() instead of sprintf(), are you:
   A. Writing code that proactively detects bugs.
   B. Writing code that proactively prevents bugs.

Did you answer "B"?  TRICK QUESTION!  The correct answer is:
  C. Writing code that proactively hides bugs.

Here's a short program that takes a directory name as an argument and prints the first line of the file "tst.c" in that directory:
#include <stdio.h>
#include <string.h>
int main(int argc, char **argv)
{
  char path[20];
  char iline[4];
  snprintf(path, sizeof(path), "%s/tst.c", argv[1]);
  FILE *fp = fopen(path, "r");
  fgets(iline, sizeof(iline), fp);
  fclose(fp);
  printf("iline='%s'\n", iline);
  return 0;
}
Nice and safe, right?  Both snprintf() and fgets() do a great job of not overflowing their buffers.  Let's run it:

$ ./tst .
iline='#inc'

Hmm ... didn't get the full input line.  I guess my iline array was too small.  But hey, at least it didn't seg fault, like it might have if I had just used scanf() or something dangerous like that!  No seg faults for me.

$ ./tst ././././././././.
Segmentation fault: 11

Um ... oh, silly me.  My path array was too small.  fopen() failed, and I didn't check its return status.

So I could, and should, check fopen()'s return status.  But that just gives me a more user-friendly error message.  It doesn't tell my *why* the file name is wrong.  Imagine the snprintf() being in a completely different area of the code.  Yes, you discover there's a bug by checking fopen(), but it's nowhere near where the bug actually is.  Same thing, by the way, with the fgets() not reading the entire line.  Who knows how much more code is going to be executed before the program misbehaves because it didn't get the entire line?

And that is my point.  Most of these "safe" functions work the same way: you pass in the size of your buffer, and the functions guarantee that they won't overrun your buffer, but give you *NO* indication that they truncated. I.e. they don't tell you when your buffer is too small.  It's not until later that something visibly misbehaves, and that wastes time and effort working your way back to the root cause.

Now I'm not suggesting that we throw away snprintf() in favor of sprintf().  I'm suggesting that using snprintf() is only half the job.  How about this:

#include <stdio.h>
#include <string.h>
#include <assert.h>
#define BUF2SMALL(_s) do {\
  assert(strnlen(_s, sizeof(_s)) < sizeof(_s)-1);\
} while (0)

int main(int argc, char **argv)
{
  char path[21];
  char iline[5];
  snprintf(path, sizeof(path), "%s/tst.c", argv[1]); BUF2SMALL(path);
  FILE *fp = fopen(path, "r");  assert(fp != NULL);
  fgets(iline, sizeof(iline), fp); BUF2SMALL(iline);
  fclose(fp);
  printf("iline='%s'\n", iline);
  return 0;
}

Now let's run it:

$ ./tst ./.
Assertion failed: (strnlen(iline, sizeof(iline)) < sizeof(iline)-1), function main, file tst.c, line 15.
Abort trap: 6
$ ./tst ././././././././.
Assertion failed: (strnlen(path, sizeof(path)) < sizeof(path)-1), function main, file tst.c, line 13.
Abort trap: 6

There.  My bugs are reported *much* closer to where they really are.

The essence of the BUF2SMALL() macro is that you should use a buffer which is at least one character larger than the maximum size you think you need.  So if you want an answer string to be able to hold either "yes" or "no", don't make it "char ans[4]", make it at least "char ans[5]".  BUF2SMALL() asserts an error if the string consumes the whole array.

One final warning.  Note that in BUF2SMALL() I use "strnlen()" instead of "strlen()".   I wrote BUF2SMALL() to be a general-purpose error checker after a variety of "safe" functions.  For example, maybe I want to use it after a "strncpy()".  Look at what the man page for "strncpy()" says:
Warning:  If there is no null byte among the first n bytes of src, the string placed in dest will not be null-terminated.
If you use "strncpy()" to copy a string, the string might not be null-terminated, and  strlen() has a good chance of segfaulting.  So I used strnlen(), which is only "safe" in that it won't segfault.  But it doesn't tell me that the string isn't null-terminated!  So I still need my macro to tell me that the buffer is too small.  The "safe" functions only make the fuse a little longer on the stick of dynamite in your program.

Saturday, June 25, 2016

Of compiler warnings and asserts in a throw-away society

Many people despair at today's "throw away" society.  If you don't want it, just throw it away.

Programmers know this is not a recent phenomenon; they've been throwing stuff away since the dawn of high-level languages.

Actual line from code I'm doing some work on:
    write(fd, str_gpio, len);

The "write" function returns a value, which the programmer threw away.  And I know why without even asking him.  If you were to challenge him, he would probably say, "I don't need the return value, and as for prudent error checking, this program has been running without a glitch for years."

Ugh.  It's never a *good* idea to throw away return values, but I've been known to do it.  But I really REALLY don't like compiler warnings:
warning: ignoring return value of 'write', declared with attribute warn_unused_result [-Wunused-result]
     write(fd, str_gpio, len);
     ^

Well, I didn't feel like analyzing the code to see how errors *should* be handled, so I just cast "write" to void to get rid of the compile warning:
    (void)write(fd, str_gpio, len);

Hmm ... still same warning.  Apparently over 10 years ago, glibc decided to make a whole lot of functions have an attribute that makes them throw that warning if the return value is ignored, and GCC decided that functions with that attribute will throw the warning *even if cast to void*.  If you like reading flame wars, the Interwebs are chock full of arguments over this.

And you know what?  Even though I'm not sure I agree with that GCC policy, it did cause me to re-visit the code and add some actual error checking.  I figured that if write() returning an error was something that "could never happen", then let's enshrine that fact in the code:
    s = write(fd, str_gpio, len);  assert(s == len);

Hmm ... different warning:
warning: unused variable 's' [-Wunused-variable]
     s = write(fd, str_gpio, len);  assert(s == len);
     ^

Huh?  I'm using it right there!  Back to Google.  Apparently, you can define a preprocessor variable to inhibit the assert code.  Some programmers like to have their asserts enabled during testing, but disabled for production for improved efficiency.  The compiler sees that the condition testing code is conditionally compiled, and decides to play it safe and throw the warning that "s" isn't used, even if the condition code is compiled in.  And yes, this also featured in the same flame wars over void casting.  I wasn't the first person to use exactly this technique to try to get rid of warnings.

*sigh*

So I ended up doing what lots of the flame war participants bemoaned having to do: writing my own assert:
#define ASSRT(cond_expr) do {\
  if (!(cond_expr)) {\
    fprintf(stderr, "ASSRT failed at %s:%d (%s)", __FILE__, __LINE__, #cond_expr);\
    fflush(stderr);\
    abort();\
} } while (0)
...
    s = write(fd, str_gpio, len);  ASSRT(s == len);

Finally, no warnings!  And better code too (not throwing away the return value).  I just don't like creating my own assert. :-(

Wednesday, September 30, 2015

Coding Pet Peeves

Ok, these things are not the end of the world.  During a code inspection, I might not even bring them up.  Or maybe I would - it would depend on my mood.


INT OR BOOLEAN

    if (use_tls != 0) {
        /* Code having to do with TLS */
    }

So, what is "use_tls"?  It's obviously somewhat of a flag - if it's non-zero then we need to do TLS stuff - but is it a count?  If it's a count, then it should have been named "use_tls_cnt", or maybe just "tls_cnt".  If it's being used as a boolean, then it should have tested as a boolean:

    if (use_tls) {
        /* Code having to do with TLS */
    }

And yes, this is actual code that I've worked on.  It was being used as a boolean, and should have been tested as one.  Is this a big deal?  No, like I said, I might not have brought it up in a review.  But I am a believer that the code should communicate information to the reader.  The boolean usage tells the reader more than the numeric version (and is easier on the eyes as well).


MALLOC AND FREE

    state_info = OUR_MALLOC_1(sizeof state_info_t);
    /* Code using state_info */
    free(state_info);

Again, actual code I've worked on.  We have a code macro for malloc which does some useful things.  But the author couldn't think of anything useful to do with free so he didn't make a free macro.  He should have.  There have been at least two times (that I know of) when it would have been tremendously useful.  One time we just wanted to log a message for each malloc and free because we suspected we were double-freeing.  Another time we thought that a freed structure was being accessed, and we wanted free to write a pattern on the memory before freeing.

"No problem," you say, "just create that macro and name it free."  Nope.  There are other mallocs: OUR_MALLOC_2, OUR_MALLOC_3, etc (no they aren't actually named that).  And some of the code doesn't use any of the macros, it just calls malloc directly!  For that second feature (writing a pattern before freeing), you need special code in the malloc part as well as the free part.  These all should have been done consistently so that OUR_MALLOC_1 only works with OUR_FREE_1, etc.  That would have allowed us to do powerful things, but as it is, we couldn't.


UNNATURAL CONDITIONAL ORDER

I bet there is an official name for this:

    if ('q' == cmd) break;

I've always felt that code should be written the way you would explain it to somebody.  Would you say, "If q is the command, then exit the loop."  No.  You would say, "If the command is q, then exit the loop."  I've seen this construct a lot where they put the constant in front of the comparison operator, and it is almost always awkward.

Oh, I know there's a good reason for it: it's a form of defensive programming.  Consider:

    if (cmd = 'q') break;

Oops, forgot an equals.  But the compiler will happily assign 'q' to cmd, test it against zero, and always do the break.  Swap the variable and the constant and you get a compile error.

But come on.  Gcc supports the "-parentheses" option which warns the above.  In fact, gcc supports a *lot* of nice optional warnings, which people should be using all the time.  "-Wall" is your friend!


DECLARE CLOSE TO FIRST USE

Ok, I'm more of a recent convert here.  I've written hundreds of functions like this:

    void my_funct(blah blah blah)
    {
        int something = 0;
        int something_else = 1;
        .... and another 10 or 15 variables

In the REALLY old days, you *had* to declare all local variables immediately after the function's open brace.  But modern C compilers let you declare variables anywhere you want.  Variables should be declared and initialized close to their first use.  This has two advantages: as the reader is reading the code and he encounters a variable, he can see right away what its type and initial value is without having to scroll up and back down again.  But more importantly, when the user sees the variable declared, he knows that the variable is not used anywhere above that declaration.  Whereas if it is declared at the top of the function, he has to examine all the function code above the first use to make sure that it isn't secretly being changed from its initial value.


POINTLESS COMMENTS

    i++;  /* increment i */

Oh, thank goodness somebody taught me what the "++" operator does!  Note to programmer: you don't need to teach me the language.

    fclose(file[i]);  /* close the ith file */

Are you sure?  I thought the "fclose" function *opened* a file.  Note to programmer: you don't have to teach me the standard library functions.  They all have man pages.

    session_cleanup(client[i]);  /* clean up the client session */

OK, session_cleanup() probably doesn't have a man page.  But the comment doesn't contain any more information than the function and variable names!

One might add comments to say what the variable "i" is for and what the files in the "filep[]" array are for.  That at least gives information that a reader might not know.  But even those should not be comments, they should be descriptive variable names.

Most comments I've encountered should simply be removed as unnecessary clutter that makes it hard to read the code.  The best comments don't try to explain *what* the code is doing, they explain *why* it is being done.  Usually that is pretty obvious, so only give the "why" explanations when the answer may not be obvious.

    client_num++;

    /* Usually we would close the log file *after* we clean up the
     * session, in case the cleanup code wants to log an error.  But
     * the session_cleanup() function is legacy and is written to
     * open and close the log file itself on error.  */
    fclose(log_file[client_num]);
    session_cleanup(client[client_num]);

That's better.


PETTY PEEVE

"Really Steve?  Brace indention?  What are you, 12?"

(grumble)  I know, I should have grown past this way back when I realized that emacs v.s. vi just doesn't matter.  And really, I don't much care between the various popular styles.  If writing from scratch, I put the open brace at the end of the "if", but I'm also happy to conform to almost any style.

But I just recently had the pleasure of peeking inside the OpenSSL source code and I saw something I hadn't seen for many years:

    if (blah)
        {
        code_to_handle_blah;
        }

It's called "Whitesmith" and supposedly has advantages.  But those advantages seem more philosophical than practical ("the alignment of the braces with the block that emphasizes the fact that the entire block is conceptually (as well as programmatically) a single compound statement").  Um ... sorry, I prefer actual utility over making a subtle point.  Having the close unindented makes it easier to find the end of the block.   The jargon file claims that, "Surveys have shown the Allman and Whitesmiths styles to be the most common, with about equal mind shares."  Really?  I've been a C programmer for over 20 years with 7 different companies, and none of those places used Whitesmith.  This is only the second time I've even seen it.  Equal mind share?   Here's a survey which has Allman 17 times as popular as Whitesmith.  I think jargon file got it wrong.

And yeah, I know that if I were immersed in it for a few months, I wouldn't even notice it any more.  Just like emacs v.s. vi, it doesn't really matter, so long as a project is internally consistent.  But hey, this is *my* rant, and I don't like it!  So there.

Wednesday, July 8, 2015

UTF-8, Unicode, and International Character Sets (Oh My!)

I'm a little embarrassed to admit that in my entire 35-year career, I've never really learned anything about international character sets (unicode, etc).  And I still know very little.  But recently I had to learn enough to at least be able to talk somewhat intelligently, so I thought I would speed up somebody else's learnings a bit.

A more in-depth page that I like is: http://www.cprogramming.com/tutorial/unicode.html

Unicode - a standard which, among other things, specifies a unique code point (numeric value) to correspond with a printed glyph (character form).  The standard defines a code point space from 1 to 1,114,111 (1 to 0x10FFFF), but as of June 2015 it only assigns 120,737 of those code points to actual glyphs.

UTF-8 - a standard which is basically a means to encode a stream of numeric values with a range from 1 to 2,147,483,647 (1 to 0x7FFFFFFF), using a variable number of bytes (1-6) to represent each value such that numerically smaller values are represented with fewer bytes.  For example, the number 17 requires a single byte to represent, whereas the number 2,000,000,000 requires 6 bytes.  Notice that the largest Unicode code point only requires 4 bytes to represent.


So the Unicode standard specifies that a particular numeric value corresponds to a specific printed character form, and UTF-8 specifies a way to encode those numeric values.  Although the UTF-8 encoding scheme could theoretically be used to encode numbers for any purpose, it was designed to encode Unicode characters reasonably efficiently.  The efficiency derives from the fact that Unicode biases the most-frequently used characters to smaller numeric values.

UTF-8 and Unicode were designed to be backward compatible with 7-bit ASCII, in that the "printable" ASCII characters have numeric values equal to the first 127 Unicode characters (1-127), and UTF-8 can represent those values with a single byte.  Thus the string "ABC" is represented in ASCII as 3 bytes with values 0x41, 0x42, 0x43, and those same three bytes are the valid UTF-8 encoding of the same string in Unicode.  Thus, an application designed to read UTF-8 input is able to read plain ASCII text.  And an application which only understands 7-bit ASCII can read a UTF-8 file which restricts itself to the first 127 Unicode characters.

Another nice thing about UTF-8 is that the bytes of a multi-byte character cannot be confused with normal ASCII characters; every byte of a multi-byte character has the most-significant bit set.  For example, the tilda-n character "ñ" has a Unicode code point 241, and the UTF-8 encoding of 241 is the two-byte sequence  0xC3, 0xB1.  It is also easy to differentiate the first byte of a multi-byte character from subsequent bytes.  Thus, if you pick a random byte in a UTF-8 buffer, it is easy to detect whether the byte is part of a mutli-byte character, and easy to find that character's first byte, or to move past it to the next character.

One thing to notice about UTF-8 is that it is trivially easy to contrive input which is illegal and will not properly parse.  For example, the bytes 0xC3, 0x41 is an illegal sequence in UTF-8 (0xC3 introduces a 2-byte character, and all bytes in a multi-byte character *must* have the most-significant byte set).


Other Encoding Schemes

There are other Unicode encoding schemes, such as UCS-2 and UTF-16, but their usage is declining.  UCS-2 cannot represent all characters of the current Unicode standard, and UTF-16 suffers from problems of ambiguous endianness (byte ordering).  Neither is backward compatible with ASCII text.

Another common non-Unicode-based encoding scheme is ISO-8859-1.  It's advantage is that all characters are represented in a single byte.  It's disadvantage is that it only covers a small fraction of the worlds languages.  It is backward compatible with ASCII text, but it is *not* compatible with UTF-8.  For example, the byte sequence 0xC3, 0x41 is a perfectly valid ISO-8859-1 sequence ("ÃA") but is illegal in UTF-8.  According to Wikipedia, ISO-8859-1 usage has been declining since 2006 while UTF-8 has been increasing.

There are a bunch of other encoding schemes, most of which are variations on the ISO-8859-1 standard, but they represent a small installed base and are not growing nearly as fast as UTF-8.

Unfortunately, there is not a reliable way to detect the encoding scheme being used simply by examining the input.  The input data either needs to be paired with metadata which identifies the encoding (like a mime header), or the user simply has to know what he has and what his software expects.

Most Unixes have available a program named "iconv" which will convert files of pretty much any encoding scheme to any other.  The user is responsible for telling "iconv" what format the input file has.


Programming with Unicode Data

Java and C# have significant features which allow them to process Unicode strings fairly well, but not perfectly.  The Java "char" type is 16 bits, which at the time Java was being defined, was adequate to hold Unicode.  But Unicode evolved to cover more writing systems and 16 bits is no longer adequate, so Java now supports UTF-16 which encodes a Unicode character in either 1 or 2 of those 16-bit chars.  Not being much of a Java or C# programmer, I can't say much more about them.

In C, Unicode is not handled natively at all.

A programmer needs to decide an encoding scheme for in-memory storage of text.  One disadvantage to using something like UTF-8 is that the number of bytes per character varies, making it difficult to randomly access characters by their offset.  If you want the 600th character, you have to start at the beginning and parse your way forward.  Thus, random access is O(n) time instead of O(constant) time that usually accompanies arrays.

One approach that evolved a while ago was the use of a "wide character", with the type "wchar_t".  This would allow you to declare an array of "wchar_t" and be able to randomly access it in O(constant) time.  In earlier days, it was thought that Unicode could live within the range 1-65535, so the original "wchar_t" was 16 bits.  Some compilers still have "wchar_t" as 16 bits (most notably Microsoft Visual Studio).  Other compilers have "wchar_t" as 32 bits (most notably gcc), which makes them a candidate for use with full unicode.

Most recent advice I've seen tells programmers to avoid the use of "wchar_t" due to its portability problems and instead use a fixed 32-bit type, like "uint32_t", which sadly did not exist in Windows until Visual Studio 2010, so you still need annoying conditional compiles to make your code truly portable.  Also, an advantage of wchar_t over uint32_t is the availability of wide flavors of many standard C string handling functions (standardized in C99).

Other opinionated programmers have advised against the use of wide characters altogether, claiming that constant time lookup is vastly overrated since most text processing software spends most of its time stepping through text one character at a time.  The use of UTF-8 allows easy movement across multi-byte characters just by looking at the upper bits of each byte.  Also, the library libiconv provides an API to do the conversions that the "iconv" command does.

And yet, I can understand the attraction of wide (32-bit) characters.  Imagine I have a large code base which does string manipulation.  The hope is that by changing the type from char to wchar_t (or uint32_t), my for loops and comparisons will "just work".  However, I've seen tons of code which assumes that a character is 1 byte (e.g. it will malloc only the number of characters without multiplying by the sizeof the type), so the chances seem small of any significant code "just working" after changing char to wchar_t or uint32_t.

Finally, note that UTF-8 is compatible with many standard C string functions because null can be safely used to indicate the end of string (some other encoding schemes can have null bytes sprinkled throughout the text).  However, note that the function strchr() is *not* generally UTF-8 compatible since it assumes that every character is a char.  But the function strstr() *is* compatible with UTF-8 (so long as *both* string parameters are encoded with UTF-8).


Bottom Line: No Free (or even cheap) Lunch

Unfortunately, there is no easy path.  If I am ever tasked with writing international software, I suspect I will bite the bullet and choose UTF-8 as my internal format.

Thursday, June 4, 2015

C Pointers: Never Too Old to Learn

I've been a C programmer for some 20 years now, and I learned something new about the language just a few days ago.

A less-experienced C coder sent me a question about pointers.  It was in an area which I have seen proficient C coders stumble on.  In my reply, I went on a few digressions to give a better understanding of C pointers.  At one point, I gave the following example:

    char *x1 = "123\n";
    char x2[] = "456\n";
    printf("%p %p", &x1, &x2);

&x1 is obvious - it evaluates to the address of the x1 variable (not the string it points at).  I was all ready to proudly claim that &x2 is illegal and generates a syntax error.  After all, x2 is an array, and referencing an array without an index evaluates to the address of the start of the array.  What does it mean to take the address of an address?    For that to make sense, the address of the array would need to be stored in a pointer variable.  But there *is* no pointer variable holding the address of the x2 array, that address is just an ephemeral value generally held in a register.  So &x2 should be illegal.

But it complied clean, not even a warning!  I got out my trusty K&R 2nd edition (ANSI C) and it says, "The operand [to the & operator] must be an lvalue referring neither to a bit-field nor to an object declared as register, or must be of function type."  Must be an lvalue, and x2 by itself is NOT an lvalue!  (An lvalue can be assigned to: lvalue = rvalue;)  Let's try the C99 spec: "The operand of the unary & operator shall be either a function designator, the result of a [] or unary * operator, or an lvalue that designates an object that is not a bit-field and is not declared with the register storage-class specifier." Different wording, but still specifies an lvalue.

Finally, google led me to a stackoverflow where somebody asked the same question.  The interesting reply:
In C, when you used the name of an array in an expression (including passing it to a function), unless it is the operand of the address-of (&) operator or the sizeof operator, it decays to a pointer to its first element.
That is, in most contexts array is equivalent to &array[0] in both type and value.
In your example, my_array has type char[100] which decays to a char* when you pass it to printf.
&my_array has type char (*)[100] (pointer to array of 100 char). As it is the operand to &, this is one of the cases that my_array doesn't immediately decay to a pointer to its first element.
The pointer to the array has the same address value as a pointer to the first element of the array as an array object is just a contiguous sequence of its elements, but a pointer to an array has a different type to a pointer to an element of that array. This is important when you do pointer arithmetic on the two types of pointer.
(Thanks to Charles Bailey for that answer back in 2010.)

Sure enough, x2 and &x2 both evaluate to the same pointer value, but x2+1 adds 1 to the address, and &x2+1 adds 5 to the address.

I can't find this mentioned in the formal C language specs I checked, but my experiments with the excellent site http://gcc.godbolt.org suggest that it is commonly implemented.  A less-formal C book also mentions it.


C Pointers V.S. Arrays

So much for my newly-learned bit of C.  Since I'm on the subject, here's a rewrite of the digression that triggered my investigation.

What does the following line do?

    char *x1 = "123\n";

The quoted string causes C to allocate an anonymous 5-byte character array in memory and initializes it (at program load time) with 31 32 33 1a 00 (hex).  I call it "anonymous" because there is no identifier (variable name) which directly refers to that array.  All you have is the address of the array, which is what a quoted string evaluates to in an expression.  That address is assigned into a pointer variable, x1.

So far, so good.  Now:

    char x2[] = "456\n";

This does *NOT* allocate an anonymous character array.  It allocates a 5-character array and assigns the variable name"x2" and pre-initializes it with 34 35 36 1a 00 (hex).  So the x2 variable (an array) is completely different from x1 (a pointer).

However, you can use the two variables in the same ways.  For example, this will do what you expect:

    printf("%s%s", x1, x2);  /* prints two lines, "123" and "456" */

It is easy to think of x1 and x2 as more the same than they really are, but I think it helps to analyze it.  Parameters passed to a function are expressions.  The values actually passed are the results of evaluating those expressions.  The above printf function is passed three expressions: "%s%s", x1, and x2.  Here's how they are handled by C:

  • "%s%s" - as before, the quoted string causes C to allocate an anonymous character array and inits it with the supplied characters.  The quoted string evaluates to the address of that array, and that address is passed as the first parameter.
  • x1 - x1 is a simple variable, so the expression evaluates to the content of the variable.  In this case, that content happens to be the address of the string "123\n".
  • x2 - x2 is *not* a simple variable, it is an array.  Normally, an array should be accessed with an index (e.g. x2[1]).  However, when an unindexed array name appears in an expression, C returns the address of the start of the array.  I.e. x2 is the same as &x2[0].  So x2 passes the address of the string "456\n".  But understand the mechanism is completely different; x1 passes the *content* of the variable x1 while x2 passes the *address* of the variable x2.

Similarly, consider this:

    printf("%c%c\n", x1[1], x2[1]);  /* print "25" */

Here again, the results are similar, but the mechanism is different.

  • x1[1] - x1 is *not* an array, so what does x1[1] mean?  It depends on x1's type.  In this case, it is a character pointer, so C takes the address contained in x1 and adds 1*sizeof(*x1) to it, and fetches the character stored at that address.  Since x1 points to type character, sizeof(*x1) evaluates to 1.  So the expression x1[1] evaluates to the character stored at the address in x1 plus 1, which is the character '2'.
  • x2[1] - this is very simple.  Since x2 is an array, x2[1] simply evaluates to the contents of the second element in the array, which is the character '5'.  Note that C does not have to do any pointer arithmetic, it just accessed the array directly.

In the above two examples, the pointer and the array can be thought of similarly.  This true most of the time, and often leads to people forgetting that they are actually different things.  For example:

    printf("%d %d\n", sizeof(x1), sizeof(x2));

This prints "4 5" on a 32-bit machine and "8 5" on a 64-bit machine.  The first number should be clear: x1 is a pointer, so sizeof(x1) should be the number of bytes in a pointer (4 or 8 depending on address length).  But what about the 5?  This is an interesting deviation from the patterns established previously.  If we followed the previous pattern, sizeof(x2) should deconstruct as x2 evaluating to the address of the start of the x2 array.  Then sizeof(x2) should return the size of the address.  However, sizeof is *not* a normal C function, it's a language built-in operator (the parentheses are optional).  It's operand is *not* interpreted as an expression, it's a type.  So sizeof(x1) is not actually the size of the x1 variable, it is the size of the *type* of x1.  Since the type of x2 is an array of 5 characters, sizeof(x2) evaluates to 5.

One more example of x1 and x2 being treated differently, which is the thing I just learned:

    if (x1 == &x1) printf("x1==&x1\n");
    if (x2 == &x2) printf("x2==&x2\n");

The output of the above two lines is the single line "x2==&x2".  Note that this code also generates a warning on the second line since x2 and &x2 are pointers to different types (pointer to a byte v.s. pointer to an array of 5 bytes).

The moral?  Although the mechanisms are different, an array and a pointer to an array can *usually* be treated the same in normal code.  But not always.  It's best to have a deep understanding of what's going on.